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1 Motivation 



Consider a collection of N agents interested in estimating the same parameter vector, w°, of size M x 1. The 
vector is the minimizer of some global cost function, denoted by J glob (w;), which the agents seek to optimize, 
say, 



argmm 



j glob H 



(i) 



We are interested in situations where the individual agents have access to partial information about the global 
cost function. In this case, cooperation among the agents becomes beneficial. For example, by cooperating 
with their neighbors, and by having these neighbors cooperate with their neighbors, procedures can be 
devised that would enable all agents in a network to converge close to the global optimum w° through local 
interactions. The objective of decentralized processing is to allow spatially distributed agents to achieve a 
global objective by relying solely on local information and on in-network processing. Through a continuous 
process of cooperation and information sharing with neighbors, agents in a network can be made to approach 
the global performance level despite the localized nature of their interactions. 



1.1 Networks and Neighborhoods 

In this article we focus mainly on connected networks, although many of the results hold even if the network 
graph is separated into disjoint subgraphs. In a connected network, if we pick any two arbitrary nodes, 
then there will exist at least one path connecting them: the nodes may be connected directly by an edge 
if they are neighbors, or they may be connected by a path that passes through other intermediate nodes. 
Figure 1 provides a graphical representation of a connected network with TV = 10 nodes. Nodes that are 
able to share information with each other are connected by edges. The sharing of information over these 
edges can be unidirectional or bi-directional. The neighborhood of any particular node is defined as the set 
of nodes that are connected to it by edges; we include in this set the node itself. The figure illustrates the 
neighborhood of node 3, which consists of the following subset of nodes: A/3 = {1,2,3,5}. For each node, 
the size of its neighborhood defines its degree. For example, node 3 in the figure has degree IA/3I = 4, while 
node 8 has degree |A/g| = 5. Nodes that are well connected have higher degrees. Note that we are denoting 
the neighborhood of an arbitrary node k by A4 and its size by | A4 1 . We shall also use the notation nk to 
refer to |A4|- 

The neighborhood of any node k therefore consists of all nodes with which node k can exchange informa- 
tion. We assume a symmetric situation in relation to neighbors so that if node k is a neighbor of node £, then 
node I is also a neighbor of node k. This does not necessarily mean that the flow of information between 
these two nodes is symmetric. For instance, in future sections, we shall assign pairs of nonnegative weights 
to each edge connecting two neighboring nodes — see Fig. 2. In particular, we will assign the coefficient aik 
to denote the weight used by node k to scale the data it receives from node £; this scaling can be interpreted 
as a measure of trustworthiness or reliability that node k assigns to its interaction with node I. Note that we 
are using two subscripts, £k, with the first subscript denoting the source node (where information originates 
from) and the second subscript denoting the sink node (where information moves to) so that: 



aik = ag _y k (information flowing from node I to node k) 



(2) 



In this way, the alternative coefficient ake will denote the weight used to scale the data sent from node k to 
i: 



ciki — ak -v i (information flowing from node k to node €) 



(3) 



The weights {aki,a£k} can be different, and one or both of them can be zero, so that the exchange of 
information over the edge connecting the neighboring nodes (fc, i) need not be symmetric. When one of the 
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neighborhood 
of node 3 



Figure 1: A network consists of a collection of cooperating nodes. Nodes that are linked by edges can share 
information. The neighborhood of any particular node consists of all nodes that are connected to it by edges 
(including the node itself). The figure illustrates the neighborhood of node 3, which consists of nodes {1,2,3,5}. 
Accordingly, node 3 has degree 4, which is the size of its neighborhood. 



weights is zero, say, dkt = 0, then this situation means that even though nodes (k,£) are neighbors, node 
I is either not receiving data from node k or the data emanating from node k is being annihilated before 
reaching node £. Likewise, when akk > 0, then node k scales its own data, whereas akk = corresponds 
to the situation when node k does not use its own data. Usually, in graphical representations like those in 
Fig. 1, edges are drawn between neighboring nodes that can share information. And, it is understood that 
the actual sharing of information is controlled by the values of the scaling weights that are assigned to the 
edge; these values can turn off communication in one or both directions and they can also scale one direction 
more heavily than the reverse direction, and so forth. 

1.2 Cooperation Among Agents 

Now, depending on the application under consideration, the solution vector w° from (1) may admit different 
interpretations. For example, the entries of w° may represent the location coordinates of a nutrition source 
that the agents are trying to hnd, or the location of an accident involving a dangerous chemical leak. 
The nodes may also be interested in locating a predator and tracking its movements over time. In these 
localization applications, the vector w° is usually two or three-dimensional. In other applications, the entries 
of ic° may represent the parameters of some model that the network wishes to learn, such as identifying 
the model parameters of a biological process or the occupied frequency bands in a shared communications 
medium. There are also situations where different agents in the network may be interested in estimating 
different entries of w°, or even different parameter vectors w° altogether, say, {u>£} for node k, albeit with 
some relation among the different vectors so that cooperation among the nodes can still be rewarding. In 
this article, however, we focus exclusively on the special (yet frequent and important) case where all agents 
are interested in estimating the same parameter vector w° . 

Since the agents have a common objective, it is natural to expect cooperation among them to be beneficial 
in general. One important question is therefore how to develop cooperation strategies that can lead to 
better performance than when each agent attempts to solve the optimization problem individually. Another 
important question is how to develop strategies that endow networks with the ability to adapt and learn in 
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Figure 2: In the top part, and for emphasis purposes, we are representing the edge between nodes k and I by two 
separate directed links: one moving from k to I and the other moving from i to k. Two nonnegative weights are used 
to scale the sharing of information over these directed links. The scalar a^t denotes the weight used to scale data 
sent from node k to £, while atk denotes the weight used to scale data sent from node i to k. The weights {ak,e, aek} 
can be different, and one or both of them can be zero, so that the exchange of information over the edge connecting 
any two neighboring nodes need not be symmetric. The bottom part of the figure illustrates directed links arriving 
to node fc from its neighbors {^1,^2,^3, ■ ■ ■} (left) and leaving from node k towards these same neighbors (right). 



real-time in response to changes in the statistical properties of the data. This article provides an overview 
of results in the area of diffusion adaptation with illustrative examples. Diffusion strategies are powerful 
methods that enable adaptive learning and cooperation over networks. There have been other useful works 
in the literature on the use of alternative consensus strategies to develop distributed optimization solutions 
over networks. Nevertheless, we explain in App. E why diffusion strategies outperform consensus strategies 
in terms of their mean-square-error stability and performance. For this reason, we focus in the body of the 
chapter on presenting the theoretical foundations of diffusion strategies and their performance. 

1.3 Notation 

In our treatment, we need to distinguish between random variables and deterministic quantities. For this 
reason, we use boldface letters to represent random variables and normal font to represent deterministic 
(non-random) quantities. For example, the boldface letter d denotes a random quantity, while the normal 
font letter d denotes an observation or realization for it. We also need to distinguish between matrices and 
vectors. For this purpose, use CAPITAL letters to refer to matrices and small letters to refer to both vectors 
and scalars; Greek letters always refer to scalars. For example, we write R to denote a covariance matrix 
and w to denote a vector of parameters. We also write to refer to the variance of a random variable. To 
distinguish between a vector d (small letter) and a scalar d (also a small letter) , we use parentheses to index 
scalar quantities and subscripts to index vector quantities. Thus, we write d(i) to refer to the value of a 
scalar quantity d at time i, and di to refer to the value of a vector quantity d at time i. Thus, d(i) denotes 
a scalar while di denotes a vector. All vectors in our presentation are column vectors, with the exception 
of the regression vector (denoted by the letter u), which will be taken to be a row vector for convenience 
of presentation. The symbol T denotes transposition, and the symbol * denotes complex conjugation for 
scalars and complex-conjugate transposition for matrices. The notation col{a, b} denotes a column vector 
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with entries a and b stacked on top of each other, and the notation diagja, 6} denotes a diagonal matrix 
with entries a and b. Likewise, the notation vec(A) vectorizes its matrix argument and stacks the columns 
of A on top of each other. The notation ||a;|| denotes the Euclidean norm of its vector argument, while 
||^||h,oo denotes the block maximum norm of a block vector (defined in App. D). Similarly, the notation 
||x||f; denotes the weighted square value, x*YjX. Moreover, ||j4||b j0O denotes the block maximum norm of a 
matrix (also defined in App. D), and p(A) denotes the spectral radius of the matrix (i.e., the largest absolute 
magnitude among its eigenvalues). Finally, Im denotes the identity matrix of size M x M; sometimes, for 
simplicity of notation, we drop the subscript M from Im when the size of the identity matrix is obvious from 
the context. Table 1 provides a summary of the symbols used in the article for ease of reference. 

Table 1: Summary of notation conventions used in the article. 



d Boldface notation denotes random variables. 

d Normal font denotes realizations of random variables. 

A Capital letters denote matrices. 

a Small letters denote vectors or scalars. 

a Greek letters denote scalars. 

d(i) Small letters with parenthesis denote scalars. 

di Small letters with subscripts denote vectors. 

T Matrix transposition. 

* Complex conjugation for scalars and complex-conjugate transposition for matrices. 

col{a, b} Column vector with entries a and b. 

diag{a, b} Diagonal matrix with entries a and b. 

vec(A) Vectorizes A and stacks its columns on top of each other. 
Euclidean norm of its vector argument. 

Hxlll Weighted square value x*Ylx. 

||a;||t>,cx> Block maximum norm of a block vector - see App. D. 

||-A||i>,oo Block maximum norm of a matrix - see App. D. 

||A|| 2— induced norm of matrix A (its largest singular value). 

p(A) Spectral radius of its matrix argument. 

Im Identity matrix of size M x M; sometimes, we drop the subscript M. 



2 Mean-Square-Error Estimation 

Readers interested in the development of the distributed optimization strategies and their adaptive versions 
can move directly to Sec. 3. The purpose of the current section is to motivate the virtues of distributed 
in-network processing, and to provide illustrative examples in the context of mean-square-error estimation. 
Advanced readers can skip this section on a first reading. 

We start our development by associating with each agent k an individual cost (or utility) function, 
Jk{w). Although the algorithms presented in this article apply to more general situations, we nevertheless 
assume in our presentation that the cost functions Jk{w) are strictly convex so that each one of them has 
a unique minimizer. We further assume that, for all costs Jk{w), the minimum occurs at the same value 
w° . Obviously, the choice of Jk(w) is limitless and is largely dependent on the application. It is sufficient 
for our purposes to illustrate the main concepts underlying diffusion adaptation by focusing on the case of 
mean-square-error (MSE) or quadratic cost functions. In the sequel, we provide several examples to illustrate 
how such quadratic cost functions arise in applications and how cooperative processing over networks can 
be beneficial. At the same time, we note that most of the arguments in this article can be extended beyond 
MSE optimization to more general cost functions — as already shown in [1-3]; see also Sec. 10.4 for a brief 
summary. 

In non-cooperative solutions, each agent would operate individually on its own cost function Jk(w) and 
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optimize it to determines w°, without any interaction with the other nodes. However, the analysis and 
derivations in future sections will reveal that nodes can benefit from cooperation between them in terms of 
better performance (such as converging faster to w° or tracking a changing w° more effectively) — see, e.g., 
Theorems 6.3-6.5 and Sec. 7.3. 

2.1 Application: Autoregressive Modeling 

Our first example relates to identifying the parameters of an auto-regressive (AR) model from noisy data. 
Thus, consider a situation where agents are spread over some geographical region and each agent k is 
observing realizations {dk{i)} of an AR zero- mean random process {dfc(i)}, which satisfies a model of the 
form: 

M 

dk{i) = ^2 f3 m d k (i - m) + v k (i) (4) 

m— 1 

The scalars {j3 m } represent the model parameters that the agents wish to identify, and v k (i) represents an 
additive zero-mean white noise process with power: 

< fc t E KM I 2 (5) 

It is customary to assume that the noise process is temporally white and spatially independent so that noise 
terms across different nodes are independent of each other and, at the same node, successive noise samples 
are also independent of each other with a time- independent variance a% fc : 

J Ev k (i)vl(j) = 0, for all i ^ j (temporal whiteness) 

[ Ev k (i)v* n (j) = 0, for alii, j whenever k ^ m (spatial whiteness) 

The noise process Vk(i) is further assumed to be independent of past signals {d&{i — m),m > 1} across all 
nodes I. Observe that we are allowing the noise power profile, erj; k , to vary with k. In this way, the quality 
of the measurements is allowed to vary across the network with some nodes collecting noisier data than other 
nodes. Figure 3 illustrates an example of a network consisting of N — 10 nodes spread over a region in space. 
The figure shows the neighborhood of node 2, which consists of nodes {1, 2, 3}. 



Linear Model 

To illustrate the difference between cooperative and non-cooperative estimation strategics, let us first explain 
that the data can be represented in terms of a linear model. To do so, we collect the model parameters {(3 m } 
into an M x 1 column vector w°: 

w° = col {ft, p M } (7) 
and the past data into a 1 x M row regression vector u^^\ 

u Kl = [ d k {i - 1) d k (i-2) ... dfc(i-M)] (8) 
Then, we can rewrite the measurement equation (4) at each node k in the equivalent linear model form: 



dfc(i) = u kii w° + v k {i) 



(9) 



Linear relations of the form (9) are common in applications and arise in many other contexts (as further 
illustrated by the next three examples in this section). 

We assume the stochastic processes {d k (i), u k ,i} in (9) have zero means and are jointly wide-sense sta- 
tionary. We denote their second-order moments by: 



2 

a d.k 



Ru.k 



Tdu,k 



A 
A 



E\d k (l)\ 2 

Eu* k 



(scalar) 
(M x M) 



Ed k (i)u* ki (M x 1) 



(10) 

(11) 
(12) 
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Figure 3: A collection of nodes, spread over a geographic region, observes realizations of an AR random process 
and cooperates to estimate the underlying model parameters {Pm} in the presence of measurement noise. The noise 
power profile can vary over space. 



As was the case with the noise power profile, we are allowing the moments {a^ k , R u ,k, fdu,k} to depend 
on the node index k so that these moments can vary with the spatial dimension as well. The covariancc 
matrix R u .k is, by definition, always non-negative definite. However, for convenience of presentation, we 
shall assume that R u ,k is actually positive-definite (and, hence, invertible): 

Ru,k > (13) 



N on- Cooperative Mean- Square- Error Solution 

One immediate result that follows from the linear model (9) is that the unknown parameter vector w° can 
be recovered exactly by each individual node from knowledge of the local moments {rd u ,k, Ru,k} alone. To 
see this, note that if we multiply both sides of (9) by u* k i and take expectations we obtain 




(14) 



so that 



Tdu,k = Ru,k W° 



^u,k r du,k 



(15) 



It is seen from (15) that w° is the solution to a linear system of equations and that this solution can be 
computed by every node directly from its moments {R u ,ki fdu.k}- It is useful to re-interpret construction 
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(15) as the solution to a minimum mean-square-error (MMSE) estimation problem [4,5]. Indeed, it can 
be verified that the quantity R~ k rdu,k that appears in (15) is the unique solution to the following MMSE 
problem: 



min E \d k (i) - u ktl w\ 2 



(16) 



To verify this claim, we denote the cost function that appears in (16) by 



J k (w) = E\d k (i)-u k ^w\ 2 (17) 

and expand it to find that 

J k (w) = (J 2 d. k ~w*r du ^ k ~r* duk w + w*R u , k w (18) 

The cost function J k (w) is quadratic in w and it has a unique minimizer since R u _ k > 0. Differentiating 
J k (w) with respect to w we find its gradient vector: 

W w J(w) = {R u ,kW - r d u,k)* (19) 

It is seen that this gradient vector is annihilated at the same value w = w° given by (15). Therefore, we can 
cquivalcntly state that if each node k solves the MMSE problem (16), then the solution vector coincides with 
the desired parameter vector w° . This observation explains why it is often justified to consider mean-square- 
error cost functions when dealing with estimation problems that involve data that satisfy linear models 
similar to (9). 

Besides w°, the solution of the MMSE problem (16) also conveys information about the noise level in 
the data. Note that by substituting w° from (15) into expression (16) for J k (w), the resulting minimum 
mean-square-error value that is attained by node k is found to be: 

MSE fc = J k (w°) 

= E\d k (i) - u kil w°\ 2 

E\v k {i)\ 2 

<u (20) 

^Ai.min 



(0) 



We shall use the notation J k (w°) and Jfc jm in interchangeably to denote the minimum cost value of J k {w). 
The above result states that, when each agent k recovers w° from knowledge of its moments {R u>k , r du .k} 
using expression (15), then the agent attains an MSE performance level that is equal to the noise power at 
its location, a 2 k . An alternative useful expression for the minimum cost can be obtained by substituting 
expression (15) for w° into (18) and simplifying the expression to find that 

MSE fe - a;,. r;,„, .n.,\r,„ j, (21) 
This second expression is in terms of the data moments {a 2 d k , r du k , R Uyk }. 



Non- Cooperative Adaptive Solution 

The optimal MMSE implementation (15) for determining w° requires knowledge of the statistical information 
{fdu,k, Ru,k}- This information is usually not available beforehand. Instead, the agents are more likely to have 
access to successive time-indexed observations {d k (i),u ki i} of the random processes {d k (i), u kt {\ for i > 0. 
In this case, it becomes necessary to devise a scheme that would allow each node to use its measurements 
to approximate w°. It turns out that an adaptive solution is possible. In this alternative implementation, 
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each node k feeds its observations {dk(i),u k ,i} hito an adaptive filter and evaluates successive estimates for 
w°. As time passes by, the estimates would get closer to w°. 

The adaptive solution operates as follows. Let Wk,i denote an estimate for w° that is computed by node 
k at time i based on all the observations {d k (j),u k j , j < i} it has collected up to that time instant. There 
are many adaptive algorithms that can be used to compute w k ^; some filters are more accurate than others 
(usually, at the cost of additional complexity) [4-7] . It is sufficient for our purposes to consider one simple 
(yet effective) filter structure, while noting that most of the discussion in this article can be extended to 
other structures. One of the simplest choices for an adaptive structure is the least-mean-squares (LMS) 
filter, where the data are processed by each node k as follows: 

efe(i) = dk{i) - u k ,iW k ,i-i (22) 
Wk,% = Wk,i-i + Hku* kl e k (i), i>0 (23) 

Starting from some initial condition, say, Wk t -i = 0, the filter iterates over i > 0. At each time instant, i, 
the filter uses the local data {d k (i), u ki i} at node k to compute a local estimation error, efc(i), via (22). The 
error is then used to update the existing estimate from Wk^—i to Wk,i via (23). The factor fi k that appears in 
(23) is a constant positive step-size parameter; usually chosen to be sufficiently small to ensure mean-square 
stability and convergence, as discussed further ahead in the article. The step-size parameter can be selected 
to vary with time as well; one popular choice is to replace jjLk in (23) with the following construction: 

= Ml'" ||2 ( 24 ) 

where e > is a small positive parameter and fXk > 0. The resulting filter implementation is known as 
normalized LMS [5] since the step-size is normalized by the squared norm of the regression vector. Other 
choices include step-size sequences > 0} that satisfy both conditions: 



E 

i=0 



= oo, ^^/i 2 (i)<oo (25) 



Such sequences converge slowly towards zero. One example is the choice Hk(i) = ^rj. However, by virtue of 
the fact that such step-sizes die out as i — > oo, then these choices end up turning off adaptation. As such, 
step-size sequences satisfying (25) are not generally suitable for applications that require continuous learning, 
especially under non-stationary environments. For this reason, in this article, we shall focus exclusively on 
the constant step-size case (23) in order to ensure continuous adaptation and learning. 

Equations (22)-(23) are written in terms of the observed quantities {d k (i),u kt i}; these are deterministic 
values since they correspond to observations of the random processes Wfc,j}. Often, when we are 

interested in highlighting the random nature of the quantities involved in the adaptation step, especially 
when we study the mean-square performance of adaptive filters, it becomes more useful to rewrite the 
recursions using the boldface notation to highlight the fact that the quantities that appear in (22)-(23) are 
actually realizations of random variables. Thus, we also write: 

e k (i) = dk(i) - u k ,iWk,i-i (26) 
w k ,i = w k ,i-i+ fMkul^ekii), i>0 (27) 

where {e k (i),w k ^} will be random variables as well. 

The performance of adaptive implementations of this kind are well-understood for both cases of stationary 
io° and changing w° [4-7]. For example, in the stationary case, if the adaptive implementation (26)-(27) 
were to succeed in having its estimator Wk,i tend to w° with probability one as i — > oo, then we would 
expect the error signal e k {i) in (26) to tend towards the noise signal v k {i) (by virtue of the linear model (9)). 
This means that, under this ideal scenario, the variance of the error signal e k (i) would be expected to tend 
towards the noise variance, k , as i — > oo. Recall from (20) that the noise variance is the least cost that 
the MSE solution can attain. Therefore, such limiting behavior by the adaptive filter would be desirable. 



9 



However, it is well-known that there is always some loss in mean-square-error performance when adaptation 
is employed due to the effect of gradient noise, which is caused by the algorithm's reliance on observations 
(or realizations) {d k (i),u k _i} rather than on the actual moments {rd u ,k, Ru,k}- In particular, it is known 
that for LMS filters, the variance of e k (i) in steady-state will always be larger than a 2 k by a small amount, 
and the size of the offset is proportional to the step-size parameter /j, k (so that smaller step-sizes lead to 
better mean-square-error (MSE) performance albeit at the expense of slower convergence). It is easy to see 
why the variance of e k (i) will be larger than a 2 , k from the definition of the error signal in (26). Introduce 
the weight-error vector 



A o 



and the so-called a-priori error signal 



e a ,k{i) = Uk,iWk,i-i 



(28) 



(29) 



This second error measures the difference between the uncorrupted term u k ^w° and its estimator prior to 
adaptation, Ufc^tOfc,*— i- It then follows from the data model (9) and from the defining expression (26) for 
e k (i) that 



e k (i) = d k (i) - u k ,iW k ,i--L 

= u ka w° + v k (i) - u kii w ki i_i 
= e 0ifc (i) + v k (i) 



(30) 



Since the noise component, v k (i), is assumed to be zero-mean and independent of all other random variables, 
we conclude that 



E\e k (i)\ 2 = E\e a , k {i) 



'v,k 



(31) 



This relation holds for any time instant i; it shows that the variance of the output error, e k (i), is always 
larger than a 2 k by an amount that is equal to the variance of the a-priori error, e a , k (i). We define the filter 
mean-square-error (MSE) and excess-mean-square-error (EMSE) as the following steady-state measures: 



MSE fe = limEle^)! 2 

i— >oo 

EMSE fe = lim E\e a , k (i)\ 2 

i— >oo 

so that, for the adaptive implementation (compare with (20)): 



MSE fc = EMSEfe + a 



v.k 



(32) 
(33) 

(34) 



Therefore, the EMSE term quantifies the size of the offset in the MSE performance of the adaptive filter. 
We also define the filter mean-square-deviation (MSD) as the steady-state measure: 



MSDfe = lim E H53fe.il 



(35) 



which measures how far w k i is from w° in the mean-square-error sense in steady-state. It is known that the 
MSD and EMSE of LMS filters of the form (26)-(27) can be approximated for sufficiently small-step sizes 
by the following expressions [4-7] : 



EMSE fc « /ifccr2 fe Tr( J R„,fc)/2 
MSDfe a /^fcAf/2 



(36) 
(37) 
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It is seen that the smaller the step-size parameter, the better the performance of the adaptive solution. 



Cooperative Adaptation through Diffusion 

Observe from (36)-(37) that even if all nodes employ the same step-size, \ik = Mi an d even if the regression 
data are spatially uniform so that R Ui k = R u for all k, the mean-square-error performance across the nodes 
still varies in accordance with the variation of the noise power profile, a% k , across the network. Nodes with 
larger noise power will perform worse than nodes with smaller noise power. However, since all nodes are 
observing data arising from the same underlying model w°, it is natural to expect cooperation among the 
nodes to be beneficial. By cooperation we mean that neighboring nodes can share information (such as 
measurements or estimates) with each other as permitted by the network topology. Starting in the next 
section, we will motivate and describe algorithms that enable nodes to carry out adaptation and learning in 
a cooperative manner to enhance performance. 

Specifically, we are going to see that one way to achieve cooperation is by developing adaptive algorithms 
that enable the nodes to optimize the following global cost function in a distributed manner: 




(38) 



where the global cost is the aggregate objective: 

N 



N 



(39) 



k=l 



k=l 



Comparing (38) with (16), we see that we are now adding the individual costs, Jfc(iu), from across all nodes. 
Note that since the desired w° satisfies (15) at every node k, then it also satisfies 



^ u > k 



N 

r dUyk 

fc=l 



(40) 



But it can be verified that the optimal solution to (38) is given by the same w° that satisfies (40). Therefore, 
solving the global optimization problem (38) also leads to the desired w°. In future sections, wc will show 
how cooperative and distributed adaptive schemes for solving (38), such as (153) or (154) further ahead, lead 
to improved performance in estimating w° (in terms of smaller mean-square-deviation and faster convergence 
rate) than the non-cooperative mode (26)~(27), where each agent runs its own individual adaptive filter — 
see, e.g., Theorems 6.3-6.5 and Sec. 7.3. 



2.2 Application: Tapped-Delay-Line Models 

Our second example to motivate MSE cost functions, Jfc(u>), and linear models relates to identifying the 
parameters of a moving-average (MA) model from noisy data. MA models are also known as finite-impulse- 
response (FIR) or tapped-delay-line models. Thus, consider a situation where agents are interested in 
estimating the parameters of an FIR model, such as the taps of a communications channel or the parameters 
of some (approximate) model of interest in finance or biology. Assume the agents are able to independently 
probe the unknown model and observe its response to excitations in the presence of additive noise; this 
situation is illustrated in Fig. 4, with the probing operation highlighted for one of the nodes (node 4). 

The schematics inside the enlarged diagram in Fig. 4 is meant to convey that each node k probes the 
model with an input sequence {«&(«)} and measures the resulting response sequence, {dk(i)\, in the presence 



11 



Figure 4: The network is interested in estimating the parameter vector w° that describes an underlying tapped- 
delay-line model. The agents are assumed to be able to independently probe the unknown system, and observe its 
response to excitations, under noise, as indicated in the highlighted diagram for node 4. 



of additive noise. The system dynamics for each agent k is assumed to be described by a MA model of the 
form: 

M-l 



d k{i) = ^2 PmUk(i - m) + v k (i) 



(41) 



rn=0 



In this model, the term v k (i) again represents an additive zero- mean noise process that is assumed to be 
temporally white and spatially independent; it is also assumed to be independent of the input process, 
{ug(j)}, for all i, j and £. The scalars {f3 m } represent the model parameters that the agents seek to identify. 
If we again collect the model parameters into an M x 1 column vector w°: 



W° = COI {/? , /?!,..., pu-l} 

and the input data into a 1 X M row regression vector: 



u k {i) Uk(i-1) 



u k {i-M + 1) ] 



(42) 



(43) 



then, we can again express the measurement equation (41) at each node k in the same linear model as (9), 
namely, 



d k (i) = u k ,iW° + v k (i) 



(44) 



As was the case with model (9), we can likewise verify that, in view of (44), the desired parameter vector 
w° satisfies the same normal equations (15), i.e., 



rdu,k = RuM w° 



w ° = R vl r du,k 



(45) 



where the moments {rdu,k-, Ru,k} continue to be defined by expressions (11)— (12) with u k i now defined by 
(43). Therefore, each node k can determine w° on its own by solving the same MMSE estimation problem 
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(16). This solution method requires knowledge of the moments {rd u ,k, Ru,k} and, according to (20), each 
agent k would then attain an MSE level that is equal to the noise power level at its location. 

Alternatively, when the statistical information {rdu.k, Ru,k} is not available, each agent k can estimate 
io° iteratively by feeding data {d k (i), u k ^} into the adaptive implementation (26)-(27). In this way, each 
agent k will achieve the same performance level shown earlier in (36)-(37), with the limiting performance 
being again dependent on the local noise power level, a 2 k . Therefore, nodes with larger noise power will 
perform worse than nodes with smaller noise power. However, since all nodes are observing data arising from 
the same underlying model w°, it is natural to expect cooperation among the nodes to be beneficial. As we 
are going to see, starting from the next section, one way to achieve cooperation and improve performance 
is by developing algorithms that optimize the same global cost function (38) in an adaptive and distributed 
manner, such as algorithms (153) and (154) further ahead. 



2.3 Application: Target Localization 

Our third example relates to the problem of locating a destination of interest (such as the location of a 
nutrition source or a chemical leak) or locating and tracking an object of interest (such as a predator or a 
projectile). In several such localization applications, the agents in the network are allowed to move towards 
the target or away from it, in which case we would end up with a mobile adaptive network [8]. Biological 
networks behave in this manner such as networks representing fish schools, bird formations, bee swarms, 
bacteria motility, and diffusing particles [8-12]. The agents may move towards the target (e.g., when it is a 
nutrition source) or away from the target (e.g., when it is a predator). In other applications, the agents may 
remain static and are simply interested in locating a target or tracking it (such as tracking a projectile). 

To motivate mean-square-error estimation in the context of localization problems, we start with the 
situation corresponding to a static target and static nodes. Thus, assume that the unknown location of the 
target in the cartesian plane is represented by the 2x1 vector w° = colja; , y°}. The agents are spread over 
the same region of space and are interested in locating the target. The location of every agent k is denoted 
by the 2x1 vector p k = col{xk, yk} in terms of its x and y coordinates - see Fig. 5. We assume the agents 
are aware of their location vectors. The distance between agent k and the target is denoted by r£ and is 
equal to: 

r° k = IK (46) 

The 1x2 unit-norm direction vector pointing from agent k towards the target is denoted by u k and is given 
by: 

k IK-Pfe|| v ; 

Observe from (46) and (47) that r k can be expressed in the following useful inner-product form: 

r% = u° k {w°-p k ) (48) 

In practice, agents have noisy observations of both their distance and direction vector towards the target. 
We denote the noisy distance measurement collected by node k at time i by: 

r k (i) = r° k + v k {i) (49) 

where v k {i) denotes noise and is assumed to be zero- mean, and temporally white and spatially independent 
with variance 

< fc ^ (50) 

We also denote the noisy direction vector that is measured by node k at time i by u k i . This vector is a 
perturbed version of u k . We assume that u kt i continues to start from the location of the node at p k , but 
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Figure 5: The distance from node k to the target is denoted by r£ and the unit-norm direction vector from the same 
node to the target is denoted by u%. Node k is assumed to have access to noisy measurements of {r%, u%}. 



that its tip is perturbed slightly either to the left or to the right relative to the tip of u k — see Fig. 6. The 
perturbation to the tip of u k is modeled as being the result of two effects: a small deviation that occurs 
along the direction that is perpendicular to u k , and a smaller deviation that occurs along the direction of 
u k . Since we are assuming that the tip of u kt i is only slightly perturbed relative to the tip of u k , then it is 
reasonable to expect the amount of perturbation along the parallel direction to be small compared to the 
amount of perturbation along the perpendicular direction. 
Thus, we write 

u kli = u% + a k (i) u k x + (3 k (i) u° k (1 x 2) (51) 

where u°^~ denotes a unit-norm row vector that lies in the same plane and whose direction is perpendicular to 
u k . The variables a k (i) and /3 k (i) denote zero-mean independent random noises that are temporally white 
and spatially independent with variances: 

< fc £ E|a fe (i)| 2 , a\ k ± E|/3 fc (z)| 2 (52) 

We assume the contribution of P k (i) is small compared to the contributions of the other noise sources, oc k (i) 
and v k (i), so that 

The random noises {v k (i), a k (i), /3 k (i)} are further assumed to be independent of each other. 

Using (48) we find that the noisy measurements {r k {i), u k} i] are related to the unknown w° via: 

r k (i) = u k)i (w - p k ) + z k {i) (54) 

where the modified noise term z k (i) is defined in terms of the noises in r k {i) and u kt i as follows: 

z k (i) = v k (i) - a k (i) u° k x ■ (w° - Pk ) - /3 k (i) ■ u° k ■ (w° - p fc ) 
= v k (i) - P k (i)-ti 

w v k (i) (55) 
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Figure 6: The tip of the noisy direction vector is modeled as being approximately perturbed away from the actual 
direction by two effects: a larger effect caused by a deviation along the direction that is perpendicular to u k , and a 
smaller deviation along the direction that is parallel to u k . 



since, by construction, 

u° k ± -(w°-p k ) = (56) 
and the contribution by (3 k (i) is assumed to be sufficiently small. If we now introduce the adjusted signal: 



dk(i) = r k (i) + u k ,i p k 



(57) 



then we arrive again from (54) and (55) at the following linear model for the available measurement variables 
{e£fc(i), u k _i} in terms of the target location w°: 



d k (i) w u k jw° + v k (i) 



(58) 



There is one important difference in relation to the earlier linear models (9) and (44), namely, the variables 
{d k (i), u k i} in (58) do not have zero means any longer. It is nevertheless straightforward to determine the 
first and second-order moments of the variables {d k (i),u ki i}. First, note from (49), (51), and (57) that 



Eufci = ul 



Edfc(i) = r° k + u° kPk 



(59) 



Even in this case of non-zero means, and in view of (58), the desired parameter vector w° can still be shown 
to satisfy the same normal equations (15), i.e., 



rdu, k = Ru, k w° w° = R u ^ k r du ,k 

where the moments {rdu, k , Ru, k } continue to be defined as 

R Uik = Eu* ki u kA , r dUtk = Eu* ki d k (i) 



(60) 



(61) 



To verify that (60) holds, we simply multiply both sides of (58) by u k i from the left, compute the expectations 
of both sides, and use the fact that v k (i) has zero mean and is assumed to be independent of for 
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all times j and nodes £. However, the difference in relation to the earlier normal equations (15) is that the 
matrix R u _ k is not the actual covariance matrix of u kj i any longer. When Uk,i is not zero mean, its covariance 
matrix is instead defined as: 



cov Mjfe = E(u kt i-u%)*(uk,i-u%) 

E* O* O 

u k,iUk,i - u k u k 



so that 



cov Uife + u° k *u° k 



(62) 



(63) 



We conclude from this relation that R Uyk is positive-definite (and, hence, invertible) so that expression (60) 
is justified. This is because the covariance matrix, cov„ is itself positive-definite. Indeed, some algebra 
applied to the difference u k i — u k from (51) shows that 



where the matrix 



,o± 



(64) 



(65) 



is full rank since the rows {u' 



o_L 



III 



} are linearly independent vectors. 



Therefore, each node k can determine w° on its own by solving the same minimum mean-square-error 
estimation problem (16). This solution method requires knowledge of the moments {rdu,k, Ru,k} and, ac- 
cording to (20), each agent k would then attain an MSE level that is equal to the noise power level, u\ k , at 
its location. 

Alternatively, when the statistical information {rd u ,k, Ru,k} is not available beforehand, each agent k can 
estimate w° iterativcly by feeding data {d k (i), u ky i} into the adaptive implementation (26)-(27). In this case, 
each agent k will achieve the performance level shown earlier in (36)-(37), with the limiting performance 
being again dependent on the local noise power level, k . Therefore, nodes with larger noise power will 
perform worse than nodes with smaller noise power. However, since all nodes are observing distances and 
direction vectors towards the same target location w°, it is natural to expect cooperation among the nodes 
to be beneficial. As we are going to see, starting from the next section, one way to achieve cooperation and 
improve performance is by developing algorithms that solve the same global cost function (38) in an adaptive 
and distributed manner, by using algorithms such as (153) and (154) further ahead. 

Role of Adaptation 

The localization application helps highlight one of the main advantages of adaptation, namely, the ability 
of adaptive implementations to learn and track changing statistical conditions. For example, in the context 
of mobile networks, where nodes can move closer or further away from a target, the location vector for each 
agent k becomes time-dependent, say, p k% i = col{x k (i),y k (i)}. In this case, the actual distance and direction 
vector between agent k and the target also vary with time and become: 



r °k{i) = \\w°-Pk,ih 



The noisy distance measurement to the target is then: 



1 k.i 



\\w°-Pk,i\\ 



r k {i) = r° k (i) + v k (i) 
where the variance of v k (i) now depends on time as well: 

A 



(66) 



(67) 



<Vfc(*) 
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In the context of mobile networks, it is reasonable to assume that the variance of v k (i) varies both with time 
and with the distance to the target: the closer the node is to the target, the less noisy the measurement of 
the distance is expected to be. Similar remarks hold for the variances of the noises a k (i) and /3 fe (i) that 
perturb the measurement of the direction vector, say, 

< fc (i) ^ E|a*(i)| 2 , o* ik (i) ± E\f3 k (i)\ 2 (69) 

where now 

u k ,i = u° k , t + a h (i)u%$ + /3 fc (i)<i (70) 

The same arguments that led to (58) can be repeated to lead to the same model, except that now the means 
of the variables {dfc(i), ttfc,j} become time-dependent as well: 

E« M =<i. EdfcW = r° k (i) + u° ktiPk ,i (71) 

Nevertheless, adaptive solutions (whether cooperative or non-cooperative), are able to track such time- 
variations because these solutions work directly with the observations {d k {i), u k: i} and the successive ob- 
servations will reflect the changing statistical profile of the data. In general, adaptive solutions are able to 
track changes in the underlying signal statistics rather well [4,5], as long as the rate of non-stationarity is 
slow enough for the filter to be able to follow the changes. 



2.4 Application: Collaborative Spectral Sensing 

Our fourth and last example to illustrate the role of mean-square-error estimation and cooperation relates 
to spectrum sensing for cognitive radio applications. Cognitive radio systems involve two types of users: 
primary users and secondary users. To avoid causing harmful interference to incumbent primary users, 
unlicensed cognitive radio devices need to detect unused frequency bands even at low signal-to-noise (SNR) 
conditions [13-16] . One way to carry out spectral sensing is for each secondary user to estimate the aggregated 
power spectrum that is transmitted by all active primary users, and to locate unused frequency bands within 
the estimated spectrum. This step can be performed by the secondary users with or without cooperation. 

Thus, consider a communications environment consisting of Q primary users and N secondary users. Let 
S q (e 3U1 ) denote the power spectrum of the signal transmitted by primary user q. To facilitate estimation 
of the spectral profile by the secondary users, we assume that each SL(e ? ' w ) can be represented as a linear 
combination of basis functions, {/m(e-'")}, say, B of them [17]: 

B 

S *(* U ) = E/W™(e J "), q=l,2,...,Q (72) 

m— 1 

In this representation, the scalars {f3 qm } denote the coefficients of the basis expansion for user q. The variable 
uj 6 [— 7r, 7r] denotes the normalized angular frequency measured in radians/sample. The power spectrum is 
often symmetric about the vertical axis, u = 0, and therefore it is sufficient to focus on the interval lo G [0, tt]. 
There are many ways by which the basis functions, {f m {e JUJ )}, can be selected. The following is one possible 
construction for illustration purposes. We divide the interval [0,7r] into B identical intervals and denote 
their center frequencies by {w m }. We then place a Gaussian pulse at each location u> m and control its width 
through the selection of its standard deviation, er m , i.e., 

f m (en = e'^t 1 - (73) 

Figure 7 illustrates this construction. The parameters {w m , a m } are selected by the designer and are assumed 
to be known. For a sufficiently large number, B, of basis functions, the representation (72) can approximate 
well a large class of power spectra. 
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Figure 7: The interval [0, n] is divided into B sub-intervals of equal width; the center frequencies of the sub-intervals 
are denoted by {u; m }. A power spectrum S q {e 3U ) is approximated as a linear combination of Gaussian basis functions 
centered on the {uj m }. 



We collect the combination coefficients {(3 qm } for primary user q into a column vector w q : 

w q = col{p ql ,p q2 ,/3 q3 ,...,p qB } (Bxl) (74) 

and collect the basis functions into a row vector: 

L=[h(en / a (e**) ... Me>")] (1 X B) (75) 

Then, the power spectrum (72) can be expressed in the alternative inner-product form: 

S q (e ju ) = f u w q (76) 

Let p q k denote the path loss coefficient from primary user q to secondary user k. When the transmitted 
spectrum S q {e Jul ) travels from primary user q to secondary user k, the spectrum that is sensed by node k 
is PijkSq^e^). We assume in this example that the path loss factors {p q k} are known and that they have 
been determined during a prior training stage involving each of the primary users with each of the secondary 
users. The training is usually repeated at regular intervals of time to accommodate the fact that the path 
loss coefficients can vary (albeit slowly) over time. Figure 8 depicts a cognitive radio system with 2 primary 
users and 10 secondary users. One of the secondary users (user 5) is highlighted and the path loss coefficients 
from the primary users to its location are indicated; similar path loss coefficients can be assigned to all other 
combinations involving primary and secondary users. 

Each user fc senses the aggregate effect of the power spectra that are transmitted by all active primary 
users. Therefore, adding the effect of all primary users, we find that the power spectrum that arrives at 
secondary user k is given by: 

Q 

Sk(e ju ) = J2p^ S ^ + °l 

9=1 

Q 

= ^PqkfuW q + a\ 
q=l 

= u Ka w° + 4 (77) 
where a\ denotes the receiver noise power at node k, and where we introduced the following vector quantities: 

w° = co1{w 1 ,w 2 ,...,wq} (BQxl) (78) 

Uk,u = [ Plkfu P2kfuj ■ ■ ■ PQkfu ] (1 x BQ) (79) 



18 



O secondary user (SU) 
% primary user (PU) 



Figure 8: A network of secondary users in the presence of two primary users. One of the secondary users is highlighted 
and the path loss coefficients from the primary users to its location are indicated as pis and P25- 



The vector w° is the collection of all combination coefficients for all Q primary users. The vector Uk,u 
contains the path loss coefficients from all primary users to user fc. Now, at every time instant i, user k 
observers its received power spectrum, Sk{e° u ') 1 over a discrete grid of frequencies, {w r }, in the interval [0, w] 
in the presence of additive measurement noise. We denote these measurements by: 



dk,r(i) = u k ,«, r w° + a\ + v k , r (i) 
r=l,2,...,R 



(80) 



The term Vk <r {i) denotes sampling noise and is assumed to have zero mean and variance cr% k ; it is also 
assumed to be temporally white and spatially independent; and is also independent of all other random 
variables. Since the row vectors Ufc ;W in (79) arc defined in terms of the path loss coefficients {p q k}, and since 
these coefficients are estimated and subject to noisy distortions, we model the Uk,u r as zero-mean random 
variables in (80) and use the boldface notation for them. 

Observe that in this application, each node k collects R measurements at every time instant i and not 
only a single measurement, as was the case with the three examples discussed in the previous sections 
(AR modeling, MA modeling, and localization). The implication of this fact is that we now deal with an 
estimation problem that involves vector measurements instead of scalar measurements at each node. The 
solution structure continues to be the same. We collect the R measurements at node k at time i into vectors 
and introduce the R x 1 quantities: 



dk 



A 



dk,i(i) 
dkaii) 



d 



■h,R 



v k, 



Vk,2(i) 



(81) 
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and the regression matrix: 



U kli = 



(R x QB) 



(82) 



The time subscript in Uk,i is used to model the fact that the path loss coefficients can change over time 
due to the possibility of node mobility. With the above notation, expression (80) is equivalent to the linear 
model: 



dk,i = U k ,iW° + Vk,i 



(83) 



Compared to the earlier examples (9), (44), and (58), the main difference now is that each agent k collects 
a vector of measurements, df-j, as opposed to the scalar d k {i), and its regression data are represented by 
the matrix quantity, U k ,i, as opposed to the row vector Uk,i- Nevertheless, the estimation approach will 
continue to be the same. In cognitive network applications, the secondary users are interested in estimating 
the aggregate power spectrum of the primary users in order for the secondary users to identify vacant 
frequency bands that can be used by them. In the context of model (83), this amounts to determining the 
parameter vector w° since knowledge of its entries allows each secondary user to reconstruct the aggregate 
power spectrum defined by: 

Q 

A 



q=l 



= (1^/ U K (84) 
denotes the Kronccker product operation, and In denotes a Q x 1 vector whose entries 



where the notation i 
are all equal to one. 

As before, we can again verify that, in view of (83), the desired parameter vector w° satisfies the same 
normal equations: 



RdU,k = Rumw° <^= 
where the moments {Rdu,k, Ru,k} are now defined by 



RdU,k 

Ru,k 



w° — R Uk Rdu,k 

{QB x I) 
{QB x QB) 



(85) 

(86) 
(87) 



Therefore, each secondary user k can determine w° on its own by solving the following minimum mean- 
squarc-error estimation problem: 



min E \\dk,i - U k ,iw\Y 



(88) 



This solution method requires knowledge of the moments {Rdu.k, Ru.k} and, in an argument similar to the 
one that led to (20), it can be verified that each agent k would attain an MSE performance level that is 
equal to the noise power level, a\ k , at its location. 

Alternatively, when the statistical information {Rdu,k, Ru,k} is not available, each secondary user k can 
estimate w° iteratively by feeding data {dk t i,Uk t i} into an adaptive implementation similar to (26)-(27), 
such as the following vector LMS recursion: 



ek,i = dk,i — U k ,iWk,i-i 
w k ,i = Wk,i-i + HkU% ie k . 



(89) 
(90) 



In this case, each secondary user k will achieve the same performance levels shown earlier in (36)-(37) with 
Ru,k replaced by Ru,k- The performance will again be dependent on the local noise level, k . As a result, 
secondary users with larger noise power will perform worse than secondary users with smaller noise power. 
However, since all secondary users are observing data arising from the same underlying model w°, it is 
natural to expect cooperation among the users to be beneficial. As we are going to see, starting from the 
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next section, one way to achieve cooperation and improve performance is by developing algorithms that solve 
the following global cost function in an adaptive and distributed manner: 



N 



min \ " :•: \\dk,i - U kyi w\\' 

w J 



(91) 



k=l 



3 Distributed Optimization via Diffusion Strategies 

The examples in the previous section were meant to illustrate how MSE cost functions and linear models are 
useful design tools and how they arise frequently in applications. We now return to problem (1) and study 
the distributed optimization of global cost functions such as (39), where J gIob (w) is assumed to consist of 
the sum of individual components. Specifically, we are now interested in solving optimization problems of 
the type: 

N 

(92) 



i >| Jk (w) 



fc=i 



where each Jk(w) is assumed to be differentiable and convex over w. Although the algorithms presented in 
this article apply to more general situations, we shall nevertheless focus on mean-square-error cost functions 
of the form: 



Jk(w) = E|dfc(i) - Uk,iW^ 



(93) 



where w is an M x 1 column vector, and the random processes {d k {i), Ufe,i} are assumed to be jointly 
wide-sense stationary with zero-mean and second-order moments: 



rr 2 A 

a d,k = 



Ru, k = Eu* kti u kti > (M x M) 
r du ,k = ^d k (i)u* k l (M x 1) 
It is clear that each J k (w) is quadratic in w since, after expansion, we get 



Jk{w) = a\ k - w*r du ,k ~ r* duk w + w* R u ,k w 



(94) 
(95) 
(96) 



(97) 



A completion-of-squarcs argument shows that Jk{w) can be expressed as the sum of two squared terms, i.e., 
JkH = (^lk-r* du .kK^du,k) + (w-w°yRuA™-™°) (98) 

or, more compactly, 



Jk{w) = Jfc.min + \\W — W° 



\Ru.k 



where w° denotes the minimizer of Jk(w) and is given by 

w ° = R ujc r duM 

and Jfe jm i n denotes the minimum value of J k {w) when evaluated at w — w°: 



Jk,min = v\k- r du,k R u\ r du,k = Jk(w°) 



(99) 



(100) 



(101) 
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Observe that this value is necessarily non-negative since it can be viewed as the Schur complement of the 
following covariance matrix: 



E 



ut „• 



[ d k (i) Uk,i ] ) = 



a d,k r du,k 
^du,k Ru,k 



(102) 



and covariance matrices are nonnegative-dcfinitc. 

The choice of the quadratic form (93) or (97) for J k {w) is useful for many applications, as was already 
illustrated in the previous section for examples involving AR modeling, MA modeling, localization, and 
spectral sensing. Other choices for Jk{w) are of course possible and these choices can even be different for 
different nodes. It is sufficient in this article to illustrate the main concepts underlying diffusion adaptation 
by focusing on the useful case of MSE cost functions of the form (97); still, most of the derivations and 
arguments in the coming sections can be extended beyond MSE optimization to more general cost functions 
— as already shown in [1-3]; see also Sec. 10.4. 

The positive-definiteness of the covariance matrices {R Uy k} ensures that each Jk{w) in (97) is strictly 
convex, as well as J glob (w) from (39). Moreover, all these cost functions have a unique minimum at the 
same w°, which satisfies the normal equations: 



Ru,k W 



'du.k i 



for every k = 1 ; 2, . . . , N 



(103) 



Therefore, given knowledge of {rd u ,k, Ru,k}, each node can determine w° on its own by solving (103). One 
then wonders about the need to seek distributed cooperative and adaptive solutions. There are a couple of 
reasons: 

(a) First, even for MSE cost functions, it is often the case that the required moments {rd u ,k, Ru,k} are not 
known beforehand. In this case, the optimal w° cannot be determined from the solution of the normal 
equations (103). The alternative methods that we shall describe will lead to adaptive techniques that 
enable each node k to estimate w° directly from data realizations. 

(b) Second, since adaptive strategies rely on instantaneous data, these strategies possess powerful tracking 
abilities. Even when the moments vary with time due to non-stationary behavior (such as w° changing 
with time), these changes will be reflected in the observed data and will in turn influence the behavior 
of the adaptive construction. This is one of the key advantages of adaptive strategies: they enable 
learning and tracking in real-time. 

(c) Third, cooperation among nodes is generally beneficial. When nodes act individually, their performance 
is limited by the noise power level at their location. In this way, some nodes can perform significantly 
better than other nodes. On the other hand, when nodes cooperate with their neighbors and share 
information during the adaptation process, we will see that performance can be improved across the 
network. 



3.1 Relating the Global Cost to Neighborhood Costs 

Let us therefore consider the optimization of the following global cost function: 



N 



J sloh H = E^m 



k=l 



(104) 



where J k (w) is given by (93) or (97). Our strategy to optimize J gloh (w) in a distributed manner is based 
on two steps, following the developments in [1,2,18]. First, using a complction-of-squarcs argument (or, 
equivalcntly, a second-order Taylor series expansion), we approximate the global cost function (104) by 
an alternative local cost that is amenable to distributed optimization. Then, each node will optimize the 
alternative cost via a steepest-descent method. 
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To motivate the distributed diffusion-based approach, we start by introducing a set of nonnegative coef- 
ficients {eke} that satisfy two conditions: 



for k = 1,2 N 



N 



cm > 0, ^ cm = 1, and cm = if £ £ A4 (105) 

i=i 

where A/fc denotes the neighborhood of node k. Condition (105) means that for every node k, the sum of the 
coefficients {eke} that relate it to its neighbors is one. The coefficients {cki] are free parameters that are 
chosen by the designer; obviously, as shown later in Theorem 6.8, their selection will have a bearing on the 
performance of the resulting algorithms. If we collect the entries {cm} into an N x N matrix C, so that the 
k— th row of C is formed of {cm, I = 1,2,..., N}, then condition (105) translates into saying that each of 
row of C adds up to one, i.e., 



CI = 1 



(106) 



where the notation 1 denotes an TV x 1 column vector with all its entries equal to one: 



1 = col{l,l,...,l} (107) 

We say that C is a right stochastic matrix. Using the coefficients {cki} so defined, we associate with each 
node t, a local cost function of the following form: 

J]°» ± c ki Jk(w) (108) 

This cost consists of a weighted combination of the individual costs of the neighbors of node t (including I 
itself) — see Fig. 9. Since the {cm} are all nonnegative and each Jk(w) is strictly convex, then j] oc (w) is 
also strictly convex and its minimizer occurs at the same w = w° . Using the alternative representation (99) 
for the individual Jk(w), we can re-express the local cost J] oc (w) as 

kGAfe keAfe 



or, equivalcntly, 



j]rw = Jl%n + \\™-™°\\'k 



(110) 



where J]°^ in corresponds to the minimum value of Jj, oc (w) at the minimizer w = w°: 

^]°min = c kl Jk.min (HI) 



and Ri is a positive-definite weighting matrix defined by: 

Rt = Y CkiR ^ ( 112 ) 

k£Af e 

That is, Ri is a weighted combination of the covariance matrices in the neighborhood of node i. Equality 
(110) amounts to a (second-order) Taylor series expansion of j] oc (w) around w = w° . Note that the right- 
hand side consists of two terms: the minimum cost and a weighted quadratic term in the difference (w — w°). 
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Cl3 + C 2 3 + C33 + C53 = 1 
4° C M = Cl 3 Jl(w) + C 23 J 2 (w) + C 33 J 3 (u>) + C 5 3J 5 (w) 



Figure 9: A network with TV = 10 nodes. The nodes in the neighborhood of node 3 are highlighted with their 
individual cost functions, and with the combination weights {C13, C23, C53} along the connecting edges; there is also a 
combination weight associated with node 3 and is denoted by C33. The expression for the local cost function, J 3 oc (w), 
is also shown in the figure. 



Now note that we can express J glob (u>) from (104) as follows 

Jglob H (105) 



(108) 



N / N \ 

£ E^Um 

fe=l \£=1 ) 
N / N \ 

£ £c fc <? J k{w) 
e=i \k=i ) 

N 

E J i oc H 



e=i 



N 



Ji oc n + E J ^ oc ( 



Substituting (110) into the second term on the right-hand side of the above expression gives: 



js>° b H = 4°» + £ n™-™iL + E j a 



loc 
min 



(113) 



(114) 



The last term in the above expression does not depend on w. Therefore, minimizing J sloh (w) over w is 
equivalent to minimizing the following alternative global cost: 



(115) 
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Expression (115) relates the optimization of the original global cost function, J glob (u>) or its equivalent 
jgiob £ Q ^ ne n ewly- introduced local cost function J k ° c (w). The relation is through the second term on 
the right-hand side of (115), which corresponds to a sum of quadratic factors involving the minimizer u>°; 
this term tells us how the local cost J k oc (uu) can be corrected to the global cost Js loh (w). Obviously, the 
minimizer w° that appears in the correction term is not known since the nodes wish to determine its value. 
Likewise, not all the weighting matrices Kg are available to node k; only those matrices that originate from 
its neighbors can be assumed to be available. Still, expression (115) suggests a useful way to replace J k ° c 
by another local cost that is closer to Js lob (w). This alternative cost will be shown to lead to a powerful 
distributed solution to optimize J sloh (w) through localized interactions. 

Our first step is to limit the summation on the right-hand side of (115) to the neighbors of node k (since 
every node k can only have access to information from its neighbors). We thus introduce the modified cost 
function at node k: 

Jt h 'H = Ji oc H+ E W w -™°\\lt ( 116 ) 

The cost functions J k oc (w) and J k ° h (w) are both associated with node k; the difference between them is 
that the expression for the latter is closer to the global cost function (115) that we want to optimize. 

The weighting matrices {Re} that appear in (116) may or may not be available because the second- 
order moments {R u ,e} may or may not be known beforehand. If these moments are known, then we can 
proceed with the analysis by assuming knowledge of the {Re}- However, the more interesting case is when 
these moments are not known. This is generally the case in practice, especially in the context of adaptive 
solutions and problems involving non-stationary data. Often, nodes can only observe realizations {ut,i} of 
the regression data {ue,i} arising from distributions whose covariance matrices are the unknown {R u .e}- 
One way to address the difficulty is to replace each of the weighted norms \\w — in (116) by a scaled 

multiple of the un-weighted norm, say, 

\\w-w°\\ 2 Re « b a -\\w-w°\\ 3 (117) 

where bek is some nonnegative coefficient; we are even allowing its value to change with the node index k. 
The above substitution amounts to having each node k approximate the {Re} from its neighbors by multiples 
of the identity matrix 

Ri « b lk I M (118) 

Approximation (117) is reasonable in view of the fact that all vector norms are equivalent [19-21]; this norm 
property ensures that we can bound the weighted norm \\w — w°\\ R by some constants multiplying the 
un-weighted norm \\w — u>°|| 2 , say, as: 

rj||u;-w || 2 < \\w-w°\\ 2 Re < r 2 \\w - w°\\ 2 (119) 

for some positive constants (ri,r2). Using the fact that the {Re} are Hcrmitian positive-definite matrices, 
and calling upon the Raylcigh-Ritz characterization of eigenvalues [19,20], we can be more specific and 
replace the above inequalities by 

X m in(Re)-\\w-w°\\ 2 < \\w-w°\\ 2 Re < X m ,ARt)-\\w-w°\\ 2 (120) 

We note that approximations similar to (118) are common in stochastic approximation theory and they 
mark the difference between using a Newton's iterative method or a stochastic gradient method [5,22]; the 
former uses Hessian matrices as approximations for Re and the latter uses multiples of the identity matrix. 
Furthermore, as the derivation will reveal, we do not need to worry at this stage about how to select the 
scalars {bek}', they will end up being embedded into another set of coefficients {aek} that will be set by the 
designer or adjusted by the algorithm — see (132) further ahead. 
Thus, we replace (116) by 

jt oh "{w) = J k ° c {w) + ]T be k \\w-w°\\ 2 (121) 
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The argument so far has suggested how to modify J^ oc (u>) from (108) and replace it by the cost (121) that is 
closer in form to the global cost function (115). If we replace Jj? c (w) by its definition (108), we can rewrite 
(121) as 

(122) 




With the exception of the variable w°, this approximate cost at node k relies solely on information that is 
available to node k from its neighborhood. We will soon explain how to handle the fact that w° is not known 
beforehand to node k. 



3.2 Steepest-Descent Iterations 

Node k can apply a steepest-descent iteration to minimize </f lob (id). Let Wk,i denote the estimate for the 
minimizcr w° that is evaluated by node k at time i. Starting from an initial condition _ l5 node k can 
compute successive estimates iteratively as follows: 



Wk,i = Wk,i-1 - Mfc 



J jf ,b "K, i _i) , *>0 (123) 



where fik is a small positive step-size parameter, and the notation V w J(a) denotes the gradient vector of the 
function J(w) relative to w and evaluated at w = a. The step-size parameter fj,k can be selected to vary with 
time as well. One choice that is common in the optimization literature [5,22,52] is to replace /j,k in (123) by 
step-size sequences {fj.(i) > 0} that satisfy the two conditions (25). However, such step-size sequences are 
not suitable for applications that require continuous learning because they turn off adaptation as i — > oo; 
the steepest-descent iteration (123) would stop updating since Hk{i) would be tending towards zero. For this 
reason, we shall focus mainly on the constant step-size case described by (123) since we are interested in 
developing distributed algorithms that will endow networks with continuous adaptation abilities. 
Returning to (123) and computing the gradient vector of (122) we get: 

w k ,i = w k ,i-i - jUfc Qfc [V™ Je(w k ,i-i)]* - Hk J! bik (wk,i-i - w°) (124) 
*£A/i ££AT k \{k} 

Using the expression for Ji[w) from (97) we arrive at 

Wk,i = Wks-i + [ik y~] Cjk (r-rf M ,i - Ru t t Wk,i-i) + Hk ^2 hk (w° — Wk,i-i) (125) 
ieU k e^M k \{k} 

This iteration indicates that the update from Wk,i—i to Wk,\ involves adding two correction terms to Wk,i—i- 
Among many other forms, we can implement the update in two successive steps by adding one correction 
term at a time, say, as follows: 

i>k,% = Wfe,i-i + V-k ^2 ctk {r du ,i ~ Ru,i Wk,i-i) (126) 



w k ,. 



ipk.i + Mfc X] hk {w° - Wk,i-x) (127) 
iaM k \{k} 



Step (126) updates Wk^-i to an intermediate value tpk.i by using local gradient vectors from the neighborhood 
of node k. Step (127) further updates tp-^i to w^i- However, this second step is not realizable since w° is 
not known and the nodes are actually trying to estimate it. Two issues stand out from examining (127): 

(a) First, iteration (127) requires knowledge of the minimizer w°. Neither node k nor its neighbors know 
the value of the minimizer; each of these nodes is actually performing steps similar to (126) and (127) 
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to estimate the minimizer. However, each node t has a readily available approximation for w°, which 
is its local intermediate estimate ijjt,i- Therefore, we replace w° in (127) by ipg^. This step helps diffuse 
information throughout the network. This is because each neighbor of node k determines its estimate 
ij)£ t i by processing information from its own neighbors, which process information from their neighbors, 
and so forth. 

(b) Second, the intermediate value tpk,i at node k is generally a better estimate for w° than Wk,i-i since it 
is obtained by incorporating information from the neighbors through the hrst step (126). Therefore, 
we further replace Wk,i-i in (127) by ipk,i- This step is reminiscent of incremental-type approaches to 
optimization, which have been widely studied in the literature [23-26] . 



With the substitutions described in items (a) and (b) above, we replace the second step (127) by 

eeAf k \{k} 

1 - Mfc htk ^ fe >* + ^ k X! bek 



Introduce the weighting coefficients: 



a-kk = 1 — A*fe ^2 ^ ek 

eeJV k \{k} 

aik = Hkhk, £ G A4\{fc} 



aek 



0. 



(128) 



(129) 

(130) 
(131) 



and observe that, for sufficiently small step-sizes /j,k, these coefficients are nonnegative and, moreover, they 
satisfy the conditions: 



for k = 1,2,..., N : 
atk > 0, 2J a £k = 1, and a ik = if I g A4 



JV 



(132) 



Condition (132) means that for every node k, the sum of the coefficients {aik} that relate it to its neighbors 
is one. Just like the {q^}, from now on, we will treat the coefficients {aik} as free weighting parameters that 
are chosen by the designer according to (132); their selection will also have a bearing on the performance of 
the resulting algorithms — see Theorem 6.8. If we collect the entries {aik} into an JV x JV matrix A, such 
that the k— th column of A consists of {aik, £ = 1,2,..., TV}, then condition (132) translates into saying 
that each column of A adds up to one: 



A 1 1 = 1 



(133) 



We say that A is a left stochastic matrix. 



3.3 Adapt-then-Combine (ATC) Diffusion Strategy 

Using the coefficients {aik} so defined, we replace (126) and (128) by the following recursions for i > 0: 



(ATC strategy) 




(134) 
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for some nonnegative coefBcients {cik,aik} that satisfy conditions (106) and (133), namely, 



Cl = l, A 1 l = t 



(135) 



or, equivalcntly, 



N 



cik > 0, 



atk > 0, 



k=l 
N 



for it = 1,2,... 
, c th = if £ 



,N: 



A4 



(136) 



if £ ^ 



To run algorithm (134), we only need to select the coefficients {aik,cik} that satisfy (135) or (136); there is 
no need to worry about the intermediate coefficients {bik} any longer since they have been blended into the 
{agk}. The scalars {c£k,ci£k} that appear in (134) correspond to weighting coefficients over the edge linking 
node k to its neighbors I G A4 . Note that two sets of coefficients are used to scale the data that are being 
received by node k: one set of coefficients, {c^}, is used in the first step of (134) to scale the moment data 
{rdu.i, Ru.e}, and a second set of coefficients, {a^}, is used in the second step of (134) to scale the estimates 
{ijje,i}- Figure 10 explains what the entries on the columns and rows of the combination matrices {A, C} 
stand for using an example with N — 6 and the matrix C for illustration. When the combination matrix is 
right-stochastic (as is the case with C), each of its rows would add up to one. On the other hand, when the 
matrix is left-stochastic (as is the case with ^4), each of its columns would add up to one. 



coefficients used to scale 
data sent from node 3 



C = 
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coefficients used to scale 
data received by node 3 
from its neighbors. 



Figure 10: Interpretation of the columns and rows of combination matrices. The pair of entries {c^e, c^k} correspond 
to weighting coefficients used over the edge connecting nodes k and I. When nodes (k, £) axe not neighbors, then 
these weights are zero. 



At every time instant i, the ATC strategy (134) performs two steps. The first step is an information 
exchange step where node k receives from its neighbors their moments {R u ,e, r du,e}- Node k combines this 
information and uses it to update its existing estimate Wk : i-i to an intermediate value ipk,i- All other nodes 
in the network are performing a similar step and updating their existing estimates {v}£ j-i} into intermediate 
estimates {il>e,i} by using information from their neighbors. The second step in (134) is an aggregation or 
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consultation step where node k combines the intermediate estimates of its neighbors to obtain its update 
estimate w k ,i- Again, all other nodes in the network are simultaneously performing a similar step. The 
reason for the name Adapt-then-Combine (ATC) strategy is that the first step in (134) will be shown to lead 
to an adaptive step, while the second step in (134) corresponds to a combination step. Hence, strategy (134) 
involves adaptation followed by combination or ATC for short. The reason for the qualification "diffusion" 
is that the combination step in (134) allows information to diffuse through the network in real time. This is 
because each of the estimates tpa is influenced by data beyond the immediate neighborhood of node k. 

In the special case when C = I, so that no information exchange is performed but only the aggregation 
step, the ATC strategy (134) reduces to: 



(ATC strategy without 
information exchange) 



ipk,i = Wk,i-l + Mfe {rdu,k - Ru.k ttffc,i-i) 



(137) 



where the first step relies solely on the information {R u .k, fdu,k} that is available locally at node k. 

Observe in passing that the term that appears in the information exchange step of (134) is related to the 
gradient vectors of the local costs {Je(w)} evaluated at Wki-x, i.e., it holds that 



Tdu,i ~ Ru,£ Wk,i-1 = - [V w Je(w k ,i-l)]* 

so that the ATC strategy (134) can also be written in the following equivalent form: 



(138) 



(ATC strategy) 



ipk,i = Wk,i-i - Mfc 


ctk [V w Je(w k ,i-i)}* 






Wk,i = 2J a £k lpl,i 









(139) 



The significance of this general form is that it is applicable to optimization problems involving more general 
local costs Jt{w) that are not necessarily quadratic in w, as detailed in [1-3] — see also Sec. 10.4. The 
top part of Fig. 11 illustrates the two steps involved in the ATC procedure for a situation where node k 
has three other neighbors labeled {1,2, £}. In the first step, node k evaluates the gradient vectors of its 
neighbors at Wk,i—i, and subsequently aggregates the estimates {ipi,i, ip2,i, ^e,i} from its neighbors. The 
dotted arrows represent flow of information towards node k from its neighbors. The solid arrows represent 
flow of information from node k to its neighbors. The CTA diffusion strategy is discussed next. 



3.4 Combine-then-Adapt (CTA) Diffusion Strategy 

Similarly, if we return to (125) and add the second correction term first, then (126)-(127) are replaced by: 



k,i-l 



Wk,i-i + [ik 2J b lk (w° - w k ,i-i) 
ieM k \{k] 

ipk,i-l + Mfe X! ° lk ( rdu . £ ~ Ru - 1 W k,i-l) 
tert k 



(140) 
(141) 



Following similar reasoning to what we did before in the ATC case, we replace w° in step (140) by wt i-i and 
replace Wkj-i in (141) by ipk,i-l- We then introduce the same coefficients {ai k } and arrive at the following 
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Figure 11: Illustration of the ATC and CTA strategies for a node k with three other neighbors {1, 2, £}. The updates 
involve two steps: information exchange followed by aggregation in ATC and aggregation followed by information 
exchange in CTA. The dotted blue arrows represent the data received from the neighbors of node k, and the solid 
red arrows represent the data sent from node k to its neighbors. 



combine-then-adapt (CTA) strategy: 



(CTA strategy) 





aik Wid-i 






W k ,i = l/>k,i-l 


+ Mfc cik {rdu,i ~ Ru,e ipk,%-i) 







(142) 



where the nonnegative coefficients {cek,o,e k } satisfy the same conditions (106) and (133), namely, 



Cl = l, A 1 l = t 



(143) 



or, equivalently, 



for k = 1,2,..., N : 

C£k > 0, c « = 1 > c «> = if I i A4 



N 



k=l 
N 



(144) 



0(k 



>0, ^ott = l, a lk = 0tiliNk 



At every time instant i, the CTA strategy (142) also consists of two steps. The first step is an aggregation 
step where node k combines the existing estimates of its neighbors to obtain the intermediate estimate V'fe.i-i ■ 
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All other nodes in the network are simultaneously performing a similar step and aggregating the estimates 
of their neighbors. The second step in (142) is an information exchange step where node k receives from 
its neighbors their moments {Rdu,i , fdu,e} and uses this information to update its intermediate estimate to 
Wh,i- Again, all other nodes in the network are simultaneously performing a similar information exchange 
step. The reason for the name Combine-then- Adapt (CTA) strategy is that the first step in (142) involves 
a combination step, while the second step will be shown to lead to an adaptive step. Hence, strategy (142) 
involves combination followed by adaptation or CTA for short. The reason for the qualification "diffusion" 
is that the combination step of (142) allows information to diffuse through the network in real time. 

In the special case when C — I, so that no information exchange is performed but only the aggregation 
step, the CTA strategy (142) reduces to: 



(CTA strategy without 
information exchange) 



ipk,i-i = ^2 atk we,i-i 

Wk,i = i>k,i-l + Mfe ( r du,k ~ Ru,k i>k,i-l) 



(145) 



where the second step relies solely on the information {Ru,k, fdu,k} that is available locally at node k. Again, 
the CTA strategy (142) can be rewritten in terms of the gradient vectors of the local costs { Jg(w)} as follows: 



(CTA strategy) 





aik wi.i-i 






Wk,i = 1pk,i-l 


- Hk 2J c ik [V w J J j(V'fc,»-i)]* 







(146) 



The bottom part of Fig. 11 illustrates the two steps involved in the CTA procedure for a situation where 
node k has three other neighbors labeled {1,2,£}. In the first step, node k aggregates the estimates 
{wi t i—i,W2 i—i,wgi—x} from its neighbors, and subsequently performs information exchange by evaluating 
the gradient vectors of its neighbors at tpk,i-i- 



3.5 Useful Properties of Diffusion Strategies 

Note that the structure of the ATC and CTA diffusion strategies (134) and (142) are fundamentally the same: 
the difference between the implementations lies in what variable we choose to correspond to the updated 
weight estimate w^ i. In the ATC case, we choose the result of the combination step to be w^i, whereas in 
the CTA case we choose the result of the adaptation step to be Wk %• 

For ease of reference, Table 2 lists the steepest-descent diffusion algorithms derived in the previous 
sections. The derivation of the ATC and CTA strategies (134) and (142) followed the approach proposed in 
[18,27]. CTA estimation schemes were first proposed in the works [35-39], and later extended in [18,27,32,33]. 
The earlier versions of CTA in [35-37] used the choice C = I. This form of the algorithm with C = I, and 
with the additional constraint that the step-sizes /j,k should be time-dependent and decay towards zero as 
time progresses, was later applied by [40,41] to solve distributed optimization problems that require all nodes 
to reach consensus or agreement. Likewise, special cases of the ATC estimation scheme (134), involving an 
information exchange step followed by an aggregation step, first appeared in the work [28] on diffusion least- 
squares schemes and subsequently in the works [18, 29-33] on distributed mean-square-error and state-space 
estimation methods. A special case of the ATC strategy (134) corresponding to the choice C = I with 
decaying step-sizes was adopted in [34] to ensure convergence towards a consensus state. Diffusion strategies 
of the form (134) and (142) (or, equivalently, (139) and (146)) are general in several respects: 

(1) These strategies do not only diffuse the local weight estimates, but they can also diffuse the local 
gradient vectors. In other words, two sets of combination coefficients {agk,C£k} are used. 
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Table 2: Summary of steepest-descent diffusion strategies for the distributed optimization of general problems of the 
form (92), and their specialization to the case of mean-square-error (MSE) individual cost functions given by (93). 



Algorithm 


Recursions 


Reference 


ATC strategy 

(general case) 


1pk,i = Wk,i-1 — Hk ^2 Clk [Vm J^(tOfc.t-l)]* 
t&M k 

Wk,i = 2J aik ipe,i 
iaM k 


(139) 


ATC strategy 

(MSE costs) 


ipk,i = Wk,i-i + £tfc ^ cik (rdu,e — Ru,i Wk,i-i) 

U)k,i = ^2 a-ik ipt,i 
l&M k 


(134) 


CTA strategy 

(general case) 


eeM k 

w k ,i = tpk,i-i - Mfc Ctk [^■MV'M-i)]* 
t&M k 


(146) 


CTA strategy 

(MSE costs) 


tpk,i-i = ^ aik wi,i-i 
eeM k 

w k ,i = ipk,i-i + Mfc Clk ( r du,l - Ru.e ^k,i-l) 
eeAT k 


(142) 



(2) In the derivation that led to the diffusion strategies, the combination matrices C and A are only 
required to be right-stochastic (for C) and left-stochastic (for A). In comparison, it is common in 
consensus-type strategies to require the corresponding combination matrix A to be doubly stochastic 
(i.e., its rows and columns should add up to one) — see, e.g., App. E and [40,42-44]. 

(3) As the analysis in Sec. 6 will reveal, ATC and CTA strategies do not force nodes to converge to an 
agreement about the desired parameter vector w°, as is common in consensus- type strategies (see 
App. E and [40,45-51]). Forcing nodes to reach agreement on w° ends up limiting the adaptation 
and learning abilities of these nodes, as well as their ability to react to information in real-time. 
Nodes in diffusion networks enjoy more flexibility in the learning process, which allows their individual 
estimates, {wfc.i}, to tend to values that lie within a reasonable mean-square-deviation (MSD) level 
from the optimal solution, w°. Multi-agent systems in nature behave in this manner; they do not 
require exact agreement among their agents (see, e.g., [8-10]). 

(4) The step-size parameters are not required to depend on the time index i and are not required 
to vanish as i — > 00 (as is common in many works on distributed optimization, e.g., [22,40,52,53]). 
Instead, the step-sizes can assume constant values, which is a critical property to endow networks 
with continuous adaptation and learning abilities. An important contribution in the study of diffusion 
strategies is to show that distributed optimization is still possible even for constant step-sizes, in 
addition to the ability to perform adaptation, learning, and tracking. Sections 5 and 6 highlight the 
convergence properties of the diffusion strategies — see also [1-3] for results pertaining to more general 
cost functions. 
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(5) Even the combination weights {aek,C£k} can be adapted, as we shall discuss later in Sec. 8.3. In this 
way, diffusion strategies allow multiple layers of adaptation: the nodes perform adaptive processing, 
the combination weights can be adapted, and even the topology can be adapted especially for mobile 
networks [81. 



4 Adaptive Diffusion Strategies 

The distributed ATC and CTA steepest-descent strategies (134) and (142) for determining the w° that solves 
(92)-(93) require knowledge of the statistical information {R u ,k, fdu.k}- These moments are needed in order 
to be able to evaluate the gradient vectors that appear in (134) and (142), namely, the terms: 



- [V™ Je(tpk,i-i)]* 



{rdu,e - Ru,i wk,i-i) 
(rdu,e - Ru,i V'fc.i-i) 



(147) 
(148) 



for all I £ A/fe- However, the moments {R u ,e, rdu,e} are often not available beforehand, which means that the 
true gradient vectors are generally not available. Instead, the agents have access to observations {dk(i), 
of the random processes {dk(i), There are many ways by which the true gradient vectors can be 

approximated by using these observations. Recall that, by definition, 



A 



Ru,t = Eu^u^j, r du j = Ed e (i)u 



t.i 



(149) 



One common stochastic approximation method is to drop the expectation operator from the definitions of 
{Ru,£, rd u ,e } and to use the following instantaneous approximations instead [4-7]: 



Ru,£ ~ u^u^i, rdu,£~de(i)u 
In this case, the approximate gradient vectors become: 

{r du j ~ Ruj Wk,i-i) ~ Ugj [dt{i) -ui ti w k ,i-x] 



(rduj — Ru,i "0/M-i) 



[di(i) - U£ ti ip. 



k,i-l\ 



(150) 

(151) 
(152) 



Substituting into the ATC and CTA steepest-descent strategies (134) and (142), we arrive at the following 
adaptive implementations of the diffusion strategies for i > 0: 



(adaptive ATC strategy) 



4>k,i 


= Wk,i-i + Vk 


c£k u* Li [dt(i) - ui.iw k ,i-i\ 






ieM k 


Wk,i 


= ^2 atk 











(153) 



and 



(adaptive CTA strategy) 



i>k,%-\ = 








w k ,i = ipk,i-l 






i^M k 



(154) 



where the coefficients {a£k,cek} are chosen to satisfy: 



forfc = l,2,...,AT: 
cik. > 0, ^2 c » = 1, cm = if I $. Afk 



N 



k=l 
N 



(155) 



a ik 



> 0, °« = !' °« = if £ & ^ 



i=i 
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The adaptive implementations usually start from the initial conditions we.-i = for all t, or from some 
other convenient initial values. Clearly, in view of the approximations (151)— (152), the successive iterates 
{itffcji, ipk,ii ipk,i—i} that are generated by the above adaptive implementations are different from the iterates 
that result from the steepest-descent implementations (134) and (142). Nevertheless, we shall continue 
to use the same notation for these variables for ease of reference. One key advantage of the adaptive 
implementations (153)-(154) is that they enable the agents to react to changes in the underlying statistical 
information {rd u ,e, Ru,e} and to changes in w°. This is because these changes end up being reflected in the 
data realizations {dk(i), Uk,i}- Therefore, adaptive implementations have an innate tracking and learning 
ability that is of paramount significance in practice. 

We say that the stochastic gradient approximations (151)-(152) introduce gradient noise into each step 
of the recursive updates (153)-(154). This is because the updates (153)-(154) can be interpreted as corre- 
sponding to the following forms: 



(adaptive ATC strategy) 



ipk,i = W k ,i- 


v— » " * 

-1—Hk 2.^ C£k [V w Je(Wk,i-l)] 















(156) 



and 



(adaptive CTA strategy) 



tpk,i-i = X! afk w e,i-i 

e&M k 



(157) 



where the true gradient vectors, {V w Je(-)}, have been replaced by approximations, {V^Jf (■)} — compare 
with (139) and (146). The significance of the alternative forms (156)-(157) is that they are applicable to 
optimization problems involving more general local costs Jg(w) that are not necessarily quadratic, as detailed 
in [2,3]; see also Sec. 10.4. In the next section, we examine how gradient noise affects the performance of 
the diffusion strategies and how close the successive estimates {u>k,i\ get to the desired optimal solution w°. 
Table 3 lists several of the adaptive diffusion algorithms derived in this section. 

The operation of the adaptive diffusion strategies is similar to the operation of the steepest-descent 
diffusion strategies of the previous section. Thus, note that at every time instant i, the ATC strategy (153) 
performs two steps; as illustrated in Fig. 12. The first step is an information exchange step where node 
k receives from its neighbors their information {dt(i),uii}. Node k combines this information and uses it 
to update its existing estimate Wk.i-i to an intermediate value ipk,i- All other nodes in the network are 
performing a similar step and updating their existing estimates {wt t i-i} into intermediate estimates {ipe,i} 
by using information from their neighbors. The second step in (153) is an aggregation or consultation step 
where node k combines the intermediate estimates {ipe,i} of its neighbors to obtain its update estimate Wf-j. 
Again, all other nodes in the network are simultaneously performing a similar step. In the special case when 
C = J, so that no information exchange is performed but only the aggregation step, the ATC strategy (153) 
reduces to: 



(adaptive ATC strategy 
without information exchange) 




(158) 
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Table 3: Summary of adaptive diffusion strategies for the distributed optimization of general problems of the form 
(92), and their specialization to the case of mean-square-error (MSE) individual cost functions given by (93). These 
adaptive solutions rely on stochastic approximations. 



Algorithm 


Recursions 


Reference 


Adaptive ATC strategy 

(general case) 


4>k,i = Wk,i-i - Mfe Clk l^™Je( w k,i-i)] 
iaN k 

V>h,i = aik ^ >e ' i 


(156) 


Adaptive ATC strategy 

(MSE costs) 


ipk,i = Wh,i-i + Mfc c ik ut,i [de{i) — ut,iWk,i-i] 
iaM k 

w k ,i = ^2 atk i>t,i 


(153) 


Adaptive ATC strategy 

(MSE costs) 

(no information exchange) 


4>k,i = Wkj-i + Hk u*k,i [dk{i) - u k ,iW k ,i-i] 

\ — ^ 

Wk,i = aik 
l^M k 


(158) 


Adaptive CTA strategy 

(general case) 


^k,i-i = ^2 aik U>e t i-i 

l^k 

lesfk 


(157) 


Adaptive CTA strategy 

(MSE costs) 


ij)k,i-i = aik we,i-i 
iaM k 

Wk,i = i>k,i-i + Mfc ^ c tk [di{i) — Ue,i ipk,i-i] 
eeM k 


(154) 


Adaptive CTA strategy 

(MSE costs) 

(no information exchange) 


ipk,i-l = ^ aik wi,i-i 
iaM k 

Wk,i = 1pk,i-l + Mfc u *k,i [dk(i) - Uk,i 


(159) 
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Figure 12: Illustration of the adaptive ATC strategy, which involves two steps: information exchange followed by 
aggregation. 



Likewise, at every time instant i, the CTA strategy (154) also consists of two steps - see Fig. 13. The 
first step is an aggregation step where node k combines the existing estimates of its neighbors to obtain the 
intermediate estimate ipk.i-i- All other nodes in the network are simultaneously performing a similar step 
and aggregating the estimates of their neighbors. The second step in (154) is an information exchange step 
where node k receives from its neighbors their information {dt(i),ugj} and uses this information to update 
its intermediate estimate to Wk,i- Again, all other nodes in the network are simultaneously performing a 
similar information exchange step. In the special case when C = I, so that no information exchange is 
performed but only the aggregation step, the CTA strategy (154) reduces to: 




Figure 13: Illustration of the adaptive CTA strategy, which involves two steps: aggregation followed by information 
exchange. 
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(adaptive CTA strategy 
without information exchange) 



ipk,i-i = X! agk Wi < i ~ 1 

Wk,i = i>k,i-l + Mfe u k,i [ d k(i) - Ufe.i tpk,i-l] 



(159) 



We further note that the adaptive ATC and CTA strategies (153)-(154) reduce to the non-cooperative 
adaptive solution (22)-(23), where each node k runs its own individual LMS filter, when the coefficients 
{a<i;,Cft} are selected as 

a-ik = &ek = cik (non-cooperative case) (160) 
where 8ou denotes the Kroneckcr delta function: 



A 



I = k 

otherwise 



In terms of the combination matrices A and C, this situation corresponds to setting 



A = I N = C 



(non-cooperative case) 



(161) 



(162) 



5 Performance of Steepest-Descent Diffusion Strategies 

Before studying in some detail the mean-square performance of the adaptive diffusion implementations (153)- 
(154), and the influence of gradient noise, we examine first the convergence behavior of the steepest-descent 
diffusion strategies (134) and (142), which employ the true gradient vectors. Doing so, will help introduce 
the necessary notation and highlight some features of the analysis in preparation for the more challenging 
treatment of the adaptive strategies in Sec. 6. 



5.1 General Diffusion Model 



Rather than study the performance of the ATC and CTA steepest-descent strategies (134) and (142) sep- 
arately, it is useful to introduce a more general description that includes the ATC and CTA recursions as 
special cases. Thus, consider a distributed steepest-descent diffusion implementation of the following general 
form for i > 0: 



»fc,i-l — 



Wfc,- 



(f>k,i-l + Mfc ^ C-lh [ r duJ — Ru,. 



»k,i-l 



^ a-2,ek 4>£,-, 



(163) 
(164) 
(165) 



where the scalars \a\^k, c#f~, a% t t.k\ denote three sets of non-negative real coefficients corresponding to the 
(I, k) entries of N x N combination matrices {A\, C, A2}, respectively. These matrices are assumed to satisfy 
the conditions: 



Afl = 1, Ct = 1, All = 1 



(166) 



37 



so that {Ai, A 2 } are left stochastic and C is right-stochastic, i.e., 



for fc = l,2,...,iV: 

cm > 0, ^ Ctt = l, c tt =0if^A4 
fc=i 

N 

ai,£k > 0, ^ ai, tt = 1, ai >tt = if £ ^ A4 ( ' 167 ^ 
e=i 

N 

d2,£k > 0, ^ CL2M = 1, a2 : £fc = if i $ Nk 
i=\ 

Different choices for {Ai,C, A 2 } correspond to different cooperation modes. For example, the choice A\ = 
In and A 2 — A corresponds to the ATC implementation (134), while the choice Ai = A and A 2 = In 
corresponds to the CTA implementation (142). Likewise, the choice C — In corresponds to the case in 
which the nodes only share weight estimates and the distributed diffusion recursions (163)~(165) become 

<f>k,i-i = 51 ai - ik w ^ i ~ 1 ( 168 ) 

lpk,i = <t>k,i-l + (r<iu,k — Ru,k <t>k,i-l) (169) 
Wk,i = ^ a,2,ik ip£,i (170) 

Furthermore, the choice A\ = A 2 = C = In corresponds to the non-cooperative mode of operation, in which 
case the recursions reduce to the classical (stand-alone) steepest-descent recursion [4-7], where each node 
minimizes individually its own quadratic cost J k {w), defined earlier in (97): 

w k ,i = w k ,i-x + Mfc [r du ,k - Ru,k u>k,i-l] , i > (171) 



Table 4: Different choices for the combination matrices {Ai, A2, C} in (163)-(165) correspond to different cooperation 
strategies. 



A 1 


A 2 


C 


Cooperation Mode 


In 


A 


C 


ATC strategy (134). 


In 


A 


In 


ATC strategy (137) without information exchange. 


A 


In 


C 


CTA strategy (142). 


A 


In 


In 


CTA strategy (145) without information exchange. 


In 


In 


In 


non-cooperative steepest-descent (171). 



5.2 Error Recursions 

Our objective is to examine whether, and how fast, the weight estimates {w k ,i} from the distributed imple- 
mentation (163)-(165) converge towards the solution w° of (92)-(93). To do so, we introduce the Mxl 
error vectors: 

<t>k,i (172) 

</>M (173) 

w k ,i (174) 



J. " o 

<Pk,i = W - 

T A 

Vk,i = w - 

A 

Wk,i = w - 
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Each of these error vectors measures the residual relative to the desired minimizer w°. Now recall from (100) 
that 

rdu,k = Ru,k w° (175) 
Then, subtracting w° from both sides of the relations in (163)-(165) we get 



ai t ik wi,i-i 



k.i 



I Im - ^k ^2 Cek J 4>k,i-l 



2J a 2,ek i>e,i 



(176) 

(177) 
(178) 



We can describe these relations more compactly by collecting the information from across the network into 
block vectors and matrices. We collect the error vectors from across all nodes into the following N X 1 block 
vectors, whose individual entries arc of size M x 1 each: 



ih,i 



N. 



W2,i 
W N . 



(179) 



The block quantities {ipi, fa, Wi] represent the state of the errors across the network at time i. Likewise, we 
introduce the following N x N block diagonal matrices, whose individual entries are of size M x M each: 



M = diag{^i/ A/ , fi 2 I M , UnIm} 



K 



diag < en R u .e, ^ c (2 Ru,e, ■ ■ ■ , ^ cm Ru., 



(180) 



(181) 



Each block diagonal entry of 7Z, say, the fc-th entry, contains the combination of the covariance matrices in 
the neighborhood of node k. We can simplify the notation by denoting these neighborhood combinations as 
follows: 



Rk — ^ Cfk Ru,. 

teN k 



so that 1Z becomes 



1Z = diag{ R u R 2 , ...,R N ] 



In the special case when C = In, the matrix 1Z reduces to 



1Z U — diag{Jl U) i, Ru,2, • • • > Ru,n} 



(when G i-T) 



(when C = I) 



(182) 



(183) 



(184) 



with the individual covariance matrices appearing on its diagonal; we denote 7Z by 7Z U in this special case. 
We further introduce the Kronecker products 



Ai = Ai®Im, M = A 2 ®I M 



(185) 
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The matrix A\ is an N x N block matrix whose (£, k) block is equal to ai^fcijvf- Likewise, for A2- In 
other words, the Kronecker transformation defined above simply replaces the matrices {^1,^2} by block 
matrices {Ai, A2} where each entry {aijk, a.2 t ik} hi the original matrices is replaced by the diagonal matrices 
{ai,iklM, a2,£klAi}- For ease of reference, Table 5 lists the various symbols that have been defined so far, 
and others that will be defined in the sequel. 



Tabic 5: Definitions of network variables used throughout the analysis. 



Variable 


Equation 


A\=Ai® Im 
A 2 = A 2 ® I M 
C = C ® I M 


(185) 
(185) 
(245) 


Rk = Clk Ru,l 

eeM k 


(182) 


TZ = diag {Ri, R2, . . . , Rn } 

IZu = diag{7? u ,i, R u ,2, • ■ • , Ru,n} 

R v = diag{(7^,i, o% t 2,---,o%,N] 

M = diag{ fiihi, ^hi, IXnIm } 

S = diag{cr^ t iR u ,i, &l,2Ru,2, ■ ■ ■ ,cI,nRu,n} 


(183) 
(241) 
(319) 
(180) 
(241) 


g = AlMC T 

B = Al {Inm ~ MTZ) AT 

y = gsg T 

T^B T ®B* 


(263) 
(264) 
(280) 
(277) 


Jk = diag{ M ,---, m ,Im,0 m ,- ..,0m} 
T k = diag{ Om,. . . ,0m, R u>k ,0m,- • ■ , Om } 


(294) 
(298) 



Returning to (176)-(178), we conclude that the following relations hold for the block quantities: 

Ajw t -! (186) 
(Inm - MTZ) (187) 
Al & (188) 

so that the network weight error vector, w,;, ends up evolving according to the following dynamics: 



i-l'i 



Wi = A2 (Inm - MTZ) A^Wi-i, i > 



(diffusion strategy) 



(189) 



For comparison purposes, if each node in the network minimizes its own cost function, J k (w), separately 
from the other nodes and uses the non-cooperative steepest-descent strategy (171), then the weight error 
vector across all N nodes would evolve according to the following alternative dynamics: 



Wi = (Inm - MTZ u )w z -i, i>0 



(non-cooperative strategy) 



(190) 



where the matrices A\ and A2 do not appear, and TZ is replaced by TZ U from (184). This recursion is a 
special case of (189) when A\ = A2 = C = Iff. 
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5.3 Convergence Behavior 

Note from (189) that the evolution of the weight error vector involves block vectors and block matrices; this 
will be characteristic of the distributed implementations that we consider in this article. To examine the 
stability and convergence properties of recursions that involve such block quantities, it becomes useful to 
rely on a certain block vector norm. In App. D, we describe a so-called block maximum block and establish 
some of its useful properties. The results of the appendix will be used extensively in our exposition. It is 
therefore advisable for the reader to review the properties stated in the appendix at this stage. 

Using the result of Lemma D.6, we can establish the following useful statement about the convergence of 
the steepest-descent diffusion strategy (163)-(165). The result establishes that all nodes end up converging 
to the optimal solution 10° if the nodes employ positive step-sizes fj,k that are small enough; the lemma 
provides a sufficient bound on the {/ife}. 

Theorem 5.1. ('Convergence to Optimal Solution,) Consider the problem of optimizing the global cost 
(92) with the individual cost functions given by (93). Pick a right stochastic matrix C and left stochastic ma- 
trices A\ and A2 satisfying (166) or (167); these matrices define the network topology and how information is 
shared over neighborhoods. Assume each node in the network runs the (distributed) steepest- descent diffusion 
algorithm (163)-(165). Then, all estimates {wk,i} across the network converge to the optimal solution w° if 
the positive step-size parameters {fJ-k} satisfy 



Amax(^fc) 



(191) 



where the neighborhood covariance matrix Rk is defined by (182). 

Proof. The weight error vector Wi converges to zero if, and only if, the coefficient matrix A2 (Inm — MTV) Ai in 
(189) is a stable matrix (meaning that all its eigenvalues lie strictly inside the unit disc). From property (605) 
established in App. D, we know that A2 (Inm — MTV) Ai is stable if the block diagonal matrix (Inm — MTV) is 
stable. It is now straightforward to verify that condition (191) ensures the stability of (Inm — MTV). It follows that 



it', 



as i — > oc 



(192) 
□ 



Observe that the stability condition (191) does not depend on the specific combination matrices Ai and A2. 
Thus, as long as these matrices are chosen to be left-stochastic, the weight-error vectors will converge to 
zero under condition (191) no matter what {^1,^2} arc. Only the combination matrix C influences the 
condition on the step-size through the neighborhood covariance matrices {Rk}- Observe further that the 
statement of the lemma does not require the network to be connected. Moreover, when C = I, in which 
case the nodes only share weight estimates and do not share the neighborhood moments {rd u ,e, Ru,e}-, as in 
(168)-(170), condition (191) becomes 



tik < 



•^max (Ru,k) 



(cooperation with C = I) 



(193) 



in terms of the actual covariance matrices {R u ,k}- Results (191) and (193) are reminiscent of a classical 
result for stand-alone steepest-descent algorithms, as in the non-cooperative case (171), where it is known 
that the estimate by each individual node in this case will converge to w° if, and only if, its positive step-size 
satisfies 



Hk < 



(Ru,k) 



(non-cooperative case (171) with A x = A 2 = C = In) 



(194) 
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This is the same condition as (193) for the case C = I. 

The following statement provides a bi-directional statement that ensures convergence of the (distributed) 
steepest-descent diffusion strategy (163)-(165) for any choice of left-stochastic combination matrices A\ and 
A 2 . 

Theorem 5.2. (Convergence for Arbitrary Combination Matrices^ Consider the problem of opti- 
mizing the global cost (92) with the individual cost functions given by (93). Pick a right stochastic matrix 
C satisfying (166). Then, the estimates {wk,i} generated by (163)-(165) converge to w° , for all choices of 
left- stochastic matrices A\ and A^ satisfying (166) if, and only if, 



Amax(^fc) 



(195) 



Proof. The result follows from property (b) of Corollary D.l, which is established in App. D. 

□ 

More importantly, we can verify that under fairly general conditions, employing the steepest-descent 
diffusion strategy (163)~(165) enhances the convergence rate of the error vector towards zero relative to the 
non-cooperative strategy (171). The next three results establish this fact when C is a doubly stochastic 
matrix, i.e., it has non- negative entries and satisfies 

CI = 1, C T t = 1 (196) 

with both its rows and columns adding up to one. Compared to the earlier right-stochastic condition on C 
in (105), we are now requiring 

J2 c M = i, J2 ctk = 1 ( 19? ) 

For example, these conditions are satisfied when C is right stochastic and symmetric. They are also satisfied 
for C = I, when only weight estimates are shared as in (168)-(170); this latter case covers the ATC and 
CTA diffusion strategies (137) and (145), which do not involve information exchange. 

Theorem 5.3. (Convergence Rate is Enhanced: Uniform Step-Sizes,) Consider the problem of opti- 
mizing the global cost (92) with the individual cost functions given by (93). Pick a doubly stochastic matrix C 
satisfying (196) and left stochastic matrices A\ and A^ satisfying (166). Consider two modes of operation. In 
one mode, each node in the network runs the (distributed) steepest- descent diffusion algorithm (163)-(165). 
In the second mode, each node operates individually and runs the non-cooperative steepest- descent algorithm 
(171). In both cases, the positive step-sizes used by all nodes are assumed to be the same, say, [ik = t 1 for 
all k, and the value of fi is chosen to satisfy the reguired stability conditions (191) and (194), which are met 
by selecting 

2 



M < min \ — — \ (198) 

l<k<N [\ max (R^ k ) J 

It then holds that the magnitude of the error vector, \\wi\\, in the diffusion case decays to zero more rapidly 
than in the non-cooperative case. In other words, diffusion cooperation enhances convergence rate. 

Proof. Let us first establish that any positive step-size u satisfying (198) will satisfy both stability conditions (191) 
and (194). It is obvious that (194) is satisfied. We verify that (191) is also satisfied when C is doubly stochastic. In 
this case, each neighborhood covariance matrix, Rk , becomes a convex combination of individual covariance matrices 
{R u ,e}, i.e., 

Rk ~ C ikRu,l 

where now 



ctk = 1 (when C is doubly stochastic) 
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To proceed, we recall that the spectral norm (maximum singular value) of any matrix X is a convex function of 
X [56]. Moreover, for Hermitian matrices X, their spectral norms coincide with their spectral radii (largest eigenvalue 
magnitude). Then, Jensen's inequality [56] states that for any convex function /(•) it holds that 



/ I y ] dmX m j < y y 9 rn f(X„ 

\ m / m 



for Hermitian matrices X m and nonnegative scalars 6 m that satisfy 

^6 m = 1 

m 

Choosing /(•) as the spectral radius function, and applying it to the definition of Rk above, we get 



p(Rk) = P ^ ClkRr, 



max p(Ru,, 

i<e<N 



In other words, 



It then follows from (198) that 



max p(R u ,, 

KKN 



A max (i?fc) < max {A m ax(-Ru,fc)i 
Kk<N 



P < 



for all k = 1,2, . . . ,N 



Amax (Rk ) 

so that (191) is satisfied as well. 

Let us now examine the convergence rate. To begin with, we note that the matrix (Inm ~ M1Z) that appears in 
the weight-error recursion (189) is block diagonal: 

(Inm - MTZ) = diag{(/ A / - pRi), (hi - pRi), ■■ ■ , (hi - pRn)} 

and each individual block entry, (Im — pRk), is a stable matrix since p satisfies (191). Moreover, each of these entries 
can be written as 

hi - pR k = ^2 c tk (hi - pR u ,e) 

which expresses (hi — pRk) as a convex combination of stable terms (Im — pRu,e)- Applying Jensen's inequality 
again we get 

p ( ^2 cik(hi — pR u ,e) < ^ cek p(hi - pRu,e) 

Now, we know from (189) that the rate of decay of Wi to zero in the diffusion case is determined by the spectral 
radius of the coefficient matrix A% (Inm — MTZ) Ai . Likewise, we know from (190) that the rate of decay of Wi to 
zero in the non-cooperative case is determined by the spectral radius of the coefficient matrix (Inm — M1Z U ). Then, 
note that 
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r t t\ (605) 

p[Ai Unm — MTZ) At < p(Inm-MTZ) 



max p (I M - pRk) 

Kk<N 



i<i a <N P c ^{hi - pRu,i) 

max c ek p(hi — pR u e) 

Kk<N ^-^ 

c lk ( max p(I M - pR u ,i) I 

^-^ \1<KN I 



max 

Kk<N 
_ _ «6JV, 



max I ( max p(I M ~ pRu,i) ) • Co, 

Kk<N I \1<£<JV / ^— ' 



= max I max p(I M — pRu,e) 

l<k<N \l<e<N 

= max p(I M - pRu,i) 

l<i<N 

= p(Inm - MTZu) 

Therefore, the spectral radius of A% {Inm — MTZ) A\ is at most as large as the largest individual spectral radius in 
the non-cooperative case. 

□ 

The argument can be modified to handle different step-sizes across the nodes if we assume uniform covariance 
data across the network, as stated below. 

Theorem 5.4. (Convergence Rate is Enhanced: Uniform Covariance Data,) Consider the same 
setting of Theorem 5.3. Assume the covariance data are uniform across all nodes, say, R u ,k = Ru is 
independent of k. Assume further that the nodes in both modes of operation employ steps-sizes pLk that are 
chosen to satisfy the required stability conditions (191) and (194), which in this case are met by: 

^ < 2 m r , k = l,2,...,N (199) 

It then holds that the magnitude of the error vector, \\wi\\, in the diffusion case decays to zero more rapidly 
than in the non-cooperative case. In other words, diffusion enhances convergence rate. 

Proof. Since R u ,t — Ru for all I and C is doubly stochastic, we get R k — R u and Inm — MTZ = Inm — MTZ u . Then, 

(605) 



p [Al {Inm - MTZ) A{ J < p{Inm - MTZ) 

= p{Inm - MTZu] 



□ 



The next statement considers the case of ATC and CTA strategies (137) and (145) without information 
exchange, which correspond to the case C — In- The result establishes that these strategies always enhance 
the convergence rate over the non-cooperative case, without the need to assume uniform step-sizes or uniform 
covariance data. 

Theorem 5.5. (Convergence Rate is Enhanced when C = I) Consider the problem of optimizing the 
global cost (92) with the individual cost functions given by (93). Pick left stochastic matrices A\ and Ai 
satisfying (166) and set C = In- This situation covers the ATC and CTA strategies (137) and (145), which 
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do not involve information exchange. Consider two modes of operation. In one mode, each node in the 
network runs the (distributed) steepest- descent diffusion algorithm (168)-(170). In the second mode, each 
node operates individually and runs the non-cooperative steepest- descent algorithm (171). In both cases, the 
positive step-sizes are chosen to satisfy the required stability conditions (193) and (194), which in this case 
are met by 

Mfe < t T5-^> k = l,2,,..,N (200) 

It then holds that the magnitude of the error vector, \\wi\\, in the diffusion case decays to zero more rapidly 
than in the non-cooperative case. In other words, diffusion cooperation enhances convergence rate. 

Proof. When C = In, we get R k = R u . k and, therefore, 1Z = 1Z U and Inm — M1Z = Inm — M1Z U . Then, 

p [Ai (Inm - MTZ) At J < P (Inm - Mil) 

= p(Inm - MTZu) 

□ 

The results of the previous theorems highlight the following important facts about the role of the combination 
matrices {Ai, A 2 ,C} in the convergence behavior of the diffusion strategy (163)-(165): 

(a) The matrix C influences the stability of the network through its influence on the bound in (191). This 
is because the matrices {Rk} depend on the entries of C. The matrices {Ai,^} do not influence 
network stability. 

(b) The matrices {Ai, A2,C} influence the rate of convergence of the network since they influence the 
spectral radius of the matrix A 2 {Inm — MTZ) Af , which controls the dynamics of the weight error 
vector in (189). 



6 Performance of Adaptive Diffusion Strategies 

We now move on to examine the behavior of the adaptive diffusion implementations (153)-(154), and the 
influence of both gradient noise and measurement noise on convergence and steady-state performance. Due 
to the random nature of the perturbations, it becomes necessary to evaluate the behavior of the algorithms 
on average, using mean-square convergence analysis. For this reason, we shall study the convergence of the 
weight estimates both in the mean and mean-square sense. To do so, we will again consider a general diffusion 
structure that includes the ATC and CTA strategies (153)-(154) as special cases. We shall further resort 
to the boldface notation to refer to the measurements and weight estimates in order to highlight the fact 
that they are now being treated as random variables. In this way, the update equations becomes stochastic 
updates. Thus, consider the following general adaptive diffusion strategy for i > 0: 

<Pk,i-i = X! ai > tk w t,i-i ( 201 ) 

if} k>i = <t> kl i-x + ^k ^2 cek u i* ~ ue '* ( 202 ) 

Wk,i = ^2 a 2 ,£k if>i,i (203) 

As before, the scalars {ffli,<fc, ctk, Q>2,tk} are non-negative real coefficients corresponding to the (l,k) entries 
of N x N combination matrices {Ai,C, A 2 }, respectively. These matrices are assumed to satisfy the same 
conditions (166) or (167). Again, different choices for {A\,C, A 2 } correspond to different cooperation modes. 
For example, the choice A\ = In and A 2 = A corresponds to the adaptive ATC implementation (153), while 
the choice A\ = A and A 2 = In corresponds to the adaptive CTA implementation (154). Likewise, the 



45 



choice C = In corresponds to the case in which the nodes only share weight estimates and the distributed 
diffusion recursions (201)-(203) become 

0k,i-i = ai ' ik w t,i-i ( 204 ) 

ifr k ,i = $ k ,i-i + HkU%,i [Mi) - «fc,i </»fc,i-i] ( 205 ) 
Wk,i = Y a 2 ,ik 4>e,i (206) 

Furthermore, the choice A\ = A 2 = C = In corresponds to the non-cooperative mode of operation, where 
each node runs the classical (stand-alone) least-mean-squares (LMS) filter independently of the other nodes: 
[4-7]: 

w k .i = w k ,i-i + HkUk,i [dk(i) - u k ,i Wk,i-i] , i > (207) 



6.1 Data Model 

When we studied the performance of the steepest-descent diffusion strategy (163)-(165) we exploited result 
(175), which indicated how the moments {rd u ,k, Ru.k} that appeared in the recursions related to the optimal 
solution w°. Likewise, in order to be able to analyze the performance of the adaptive diffusion strategy 
(201)-(203), we need to know how the data {dk(i),Uk,i} across the network relate to w°. Motivated by the 
several examples presented earlier in Sec. 2, we shall assume that the data satisfy a linear model of the form: 



dfe(i) = u k .iw° + v k (i) 



(208) 



2 


A 


a d,k 




Ru,k 


A 






A 


Tdu.k 





where Vk(i) is measurement noise with variance k : 

o\ k t EK(i)f (209) 

and where the stochastic processes {dk(i),Uk,i} are assumed to be jointly wide-sense stationary with mo- 
ments: 

E\d k {i)\ 2 (scalar) (210) 

Eu* kti u k ,i > (M x M) (211) 

Ed k (i)u* kti (M x 1) (212) 

All variables are assumed to be zero-mean. Furthermore, the noise process {v k (i)} is assumed to be tempo- 
rally white and spatially independent, as described earlier by (6), namely, 

\v k (i)v* k (j) = 0, for alH ^ j (temporal whiteness) , , 

I , Ufc(i)'u£j(j) = 0, for all i, j whenever k ^ m (spatial whiteness) 

The noise process Vk{i) is further assumed to be independent of the regression data u m< j for all k, m and 
i,j so that: 

E«fc(i)u^ 3 - = 0, for all k, m, i, j (214) 
We shall also assume that the regression data are temporally white and spatially independent so that: 

Eu^«ij = R u ,k5kAj (215) 

Although we are going to derive performance measures for the network under this independence assumption 
on the regression data, it turns out that the resulting expressions continue to match well with simulation 
results for sufficiently small step-sizes, even when the independence assumption docs not hold (in a manner 
similar to the behavior of stand-alone adaptive filters) [4, 5] . 
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6.2 Performance Measures 



<f>k,i 




W° 


~ <t>k,i 




A 


W° 


-fl>k,i 




A 










w° 





Our objective is to analyze whether, and how fast, the weight estimates {w k .i} from the adaptive diffusion 
implementation (201)-(203) converge towards uu°. To do so, we again introduce the M x 1 weight error 
vectors: 

(216) 
(217) 
(218) 

Each of these error vectors measures the residual relative to the desired w° in (208). We further introduce 
two scalar error measures: 

e k (i) = dk(i) — Uk,iWk,i-i (output error) (219) 
e Q .fe(i) = Uk,iWk,i-i (a-priori error) (220) 

The first error measures how well the term Uk,iWk,%-\ approximates the measured data, dfc(i); in view of 
(208), this error can be interpreted as an estimator for the noise term Vk{i). If node k is able to estimate w° 
well, then e k (i) would get close to Vk(i)- Therefore, under ideal conditions, we would expect the variance of 
e k (i) to tend towards the variance of v k (i). However, as remarked earlier in (31), there is generally an offset 
term for adaptive implementations that is captured by the variance of the a-priori error, e Q }k [i). This second 
error measures how well Ufc,iWfc,i— l approximates the uncorrupted term u ki w°. Using the data model (208), 
we can relate {e k (i), e ak (i)} as 

e*(i) = e atk + v k {i) (221) 

Since the noise component, v k {i), is assumed to be zero-mean and independent of all other random variables, 
we recover (31): 



E\e k (i)\ 2 = E\e a , k {i) 



'v,k 



(222) 



This relation confirms that the variance of the output error, &k{i), is always at least as large as a 2 k and 
away from it by an amount that is equal to the variance of the a-priori error, e a ,fc(i). Accordingly, in order 
to quantify the performance of any particular node in the network, we define the mean-square-error (MSE) 
and excess-mean-square-error (EMSE) for node k as the following steady-state measures: 

MSE fc = limE|e fc (i)| 2 (223) 

i— >oo 

EMSE fc = lim E\e Chk (i)\ 2 (224) 



Then, it holds that 



MSE fc = EMSE fe + a 2 vM 



(225) 



Therefore, the EMSE term quantifies the size of the offset in the MSE performance of each node. We also 
define the mean-square-deviation (MSD) of each node as the steady-state measure: 

MSD fc = limE||53 M || 2 (226) 

i— f oo 

which measures how far w k .i is from w° in the mean-square-error sense. 

Wc indicated earlier in (36)-(37) how the MSD and EMSE of stand-alone LMS filters in the non- 
cooperative case depend on {jJLk, c 2 , R u ,k}- In this section, we examine how cooperation among the nodes 
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influences their performance. Since cooperation couples the operation of the nodes, with data originating 
from one node influencing the behavior of its neighbors and their neighbors, the study of the network per- 
formance requires more effort than in the non-cooperative case. Nevertheless, when all is said and done, 
we will arrive at expressions that approximate well the network performance and reveal some interesting 
conclusions. 



6.3 Error Recursions 

Using the data model (208) and subtracting w° from both sides of the relations in (201)-(203) we get 

0k,i-i = ai ' tk ( 227 ) 



WkA 



(228) 
(229) 



Comparing the second recursion with the corresponding recursion in the steepest-descent case (176)-(178), 
we see that two new effects arise: the effect of gradient noise, which replaces the covariance matrices R u .i 
by the instantaneous approximation ut ,u^j, and the effect of measurement noise, vg{i). 

We again describe the above relations more compactly by collecting the information from across the 
network in block vectors and matrices. We collect the error vectors from across all nodes into the following 
N x 1 block vectors, whose individual entries are of size M x 1 each: 



A 



1> 



N,i 



1 A 



0i,t 

4>2,i 



N,i 



~ A 
Wi = 



Wl,i 
W2,i 

W N ,i 



(230) 



The block quantities 0j, w^} represent the state of the errors across the network at time i. Likewise, 
we introduce the following N x N block diagonal matrices, whose individual entries are of size M x M each: 



M = diag{ (iiI M , ^Im, UnIm } 



Hi = diag< eg uljU^j, ^ 



(231) 
(232) 



Each block diagonal entry of Hi, say, the fc-th entry, contains a combination of rank-one regression terms 
collected from the neighborhood of node k. In this way, the matrix Hi is now stochastic and dependent 
on time, in contrast to the matrix 1Z in the steepest-descent case in (181), which was a constant matrix. 
Nevertheless, it holds that 



n 



(233) 



so that, on average, Hi agrees with 1Z. We can simplify the notation by denoting the neighborhood combi- 
nations as follows: 

Rk,i = X Cek u h u v ( 234 ) 

so that H; becomes 



Hi = diag { R lyi , R 2 ,i, Rna } (when C ^ I) 



(235) 
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Again, compared with the matrix Rk defined in (182), we find that Rk,i is now both stochastic and time- 
dependent. Nevertheless, it again holds that 



ER k . t = R k 



(236) 



In the special case when C = I, the matrix 7^ reduces to 



Tl Uti = diag-tu^iti^, u 2l u 2 , u u* Ni u N;i } (when C = I) 



with 



EH* 



(237) 
(238) 



where 1Z U was defined earlier in (184). 

We further introduce the following N x 1 block column vector, whose entries are of size M x 1 each: 



col{ uljv^i), u* 2i v 2 (i), u* Nt v N (i) } 



(239) 



Obviously, given that the regression data and measurement noise are zero-mean and independent of each 
other, we have 



Es; = 



and the covariance matrix of Si is N x N block diagonal with blocks of size M x M: 



S = IKsiS* = diag{af }1 R u s, a z v 2 R u ^, . ■ . ,(t^ n R u ,n} 



Returning to (227)-(229), we conclude that the following relations hold for the block quantities: 



<t>i-i 

Wi 



{iNM-MKi) - MC Si 



where 



C = C®1 



M 



(240) 
(241) 

(242) 
(243) 
(244) 

(245) 



so that the network weight error vector, to,, ends up evolving according to the following stochastic recursion: 



Wi = A 2 {I N m - MTli) A^Wi-i - A 2 MC T s i7 i > 



(diffusion strategy) (246) 



For comparison purposes, if each node operates individually and uses the non-cooperative LMS recursion 
(207), then the weight error vector across all N nodes would evolve according to the following stochastic 
recursion: 



Wi = (7jvm - M1Z u ,i) w t -i - Msi, i>0 



(non-cooperative strategy) 



(247) 



where the matrices A\ and A2 do not appear, and TLi is replaced by TL u .i from (237). 
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6.4 Convergence in the Mean 

Taking expectations of both sides of (246) we find that: 



Aid 



NAI 



MK)A( -ESji-i, i > 



(diffusion strategy) 



(248) 



where we used the fact that Wi-i and 7^ are independent of each other in view of our earlier assumptions 
on the regression data and noise in Sec. 6.1. Comparing with the error recursion (189) in the steepest- 
descent case, we find that both recursions are identical with Wi replaced by Ewi. Therefore, the convergence 
statements from the steepest-descent case can be extended to the adaptive case to provide conditions on the 
step-size to ensure stability in the mean, i.e., to ensure 



Ew, 



as 



(249) 



When (249) is guaranteed, we would say that the adaptive diffusion solution is asymptotically unbiased. The 
following statements restate the results of Theorems 5.1-5.5 in the context of mean error analysis. 

Theorem 6.1. (Convergence in the Mean,) Consider the problem of optimizing the global cost (92) with 
the individual cost junctions given by (93). Pick a right stochastic matrix C and left stochastic matrices A\ 
and Ai satisfying (166) or (167). Assume each node in the network measures data that satisfy the conditions 
described in Sec. 6.1, and runs the adaptive diffusion algorithm (201)-(203). Then, all estimators {wk^i} 
across the network converge in the mean to the optimal solution w° if the positive step-size parameters {^k} 
satisfy 



tik < 



Amax(fit) 



where the neighborhood covariance matrix Rk is defined by (182). In other words, Eid^ 
k as i —> oo. 



(250) 



to" for all nodes 



□ 

Observe again that the mean stability condition (250) does not depend on the specific combination matrices 
A\ and A^ that are being used. Only the combination matrix C influences the condition on the step-size 
through the neighborhood covariance matrices {Rk}- Observe further that the statement of the lemma docs 
not require the network to be connected. Moreover, when C = In, in which case the nodes only share weight 
estimators and do not share neighborhood data {dp{i),ui^} as in (204)-(206), condition (250) becomes 



A«fc < 



Amax(-R«,fc) 



(adaptive cooperation with C = In) 



(251) 



Results (250) and (251) are reminiscent of a classical result for the stand-alone LMS algorithm, as in the 
non-cooperative case (207), where it is known that the estimator by each individual node in this case would 
converge in the mean to w° if, and only if, its step-size satisfies 



Amax(-Rti.fc) 



(non-cooperative adaptation) 



(252) 



The following statement provides a bi-directional result that ensures the mean convergence of the adaptive 
diffusion strategy for any choice of left-stochastic combination matrices A\ and A2 . 
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Theorem 6.2. (Mean Convergence for Arbitrary Combination Matrices,) Consider the problem of 
optimizing the global cost (92) with the individual cost functions given by (93). Pick a right stochastic matrix 
C satisfying (166). Assume each node in the network measures data that satisfy the conditions described in 
Sec. 6.1. Then, the estimators {wk,i} generated by the adaptive diffusion strategy (201)-(203), converge in 
the mean to w° , for all choices of left stochastic matrices A\ and Ai satisfying (166) if, and only if, 



Amax(^fc) 



(253) 



□ 

As was the case with steepest-descent diffusion strategies, the adaptive diffusion strategy (201)-(203) also 
enhances the convergence rate of the mean of the error vector towards zero relative to the non-cooperative 
strategy (207). The next results restate Theorems 5.3-5.5; they assume C is a doubly stochastic matrix. 

Theorem 6.3. (Mean Convergence Rate is Enhanced: Uniform Step-Sizes J Consider the problem 
of optimizing the global cost (92) with the individual cost functions given by (93). Pick a doubly stochastic 
matrix C satisfying (196) and left stochastic matrices A\ and A-i satisfying (166). Assume each node in the 
network measures data that satisfy the conditions described in Sec. 6.1. Consider two modes of operation. 
In one mode, each node in the network runs the adaptive diffusion algorithm (201)-(203). In the second 
mode, each node operates individually and runs the non-cooperative LMS algorithm (207). In both cases, the 
positive step-sizes used by all nodes are assumed to be the same, say, fJ-k = fJ- for all k, and the value of \i is 
chosen to satisfy the required mean stability conditions (250) and (252), which are met by selecting 



P- < min 1 "\ TT, — r \ (254) 

l<k<N I X max (Ru,k) J 

It then holds that the magnitude of the mean error vector, \\E,Wi\\ in the diffusion case decays to zero more 
rapidly than in the non-cooperative case. In other words, diffusion enhances convergence rate. 

□ 



Theorem 6.4. (Mean Convergence Rate is Enhanced: Uniform Covariance Data) Consider the 
same setting of Theorem 6.3. Assume the covariance data are uniform across all nodes, say, R u ,k = Ru is 
independent of k. Assume further that the nodes in both modes of operation employ steps-sizes \ik that are 
chosen to satisfy the required stability conditions (250) and (252), which in this case are met by: 

Wfc < . 2 m y k = l,2,...,N (255) 

It then holds that the magnitude of the mean error vector, ||EC&t||, in the diffusion case also decays to zero 
more rapidly than in the non-cooperative case. In other words, diffusion enhances convergence rate. 

□ 

The next statement considers the case of ATC and CTA strategies (204)-(206) without information exchange, 
which correspond to the choice C = In- The result establishes that these strategies always enhance the 
convergence rate over the non-cooperative case, without the need to assume uniform step-sizes or uniform 
covariance data. 

Theorem 6.5. (Mean Convergence Rate is Enhanced when C = I) Consider the problem of opti- 
mizing the global cost (92) with the individual cost functions given by (93). Pick left stochastic matrices A\ 
and A-i satisfying (166) and set C = In- This situation covers the ATC and CTA strategies (204)-(206) 
that do not involve information exchange. Assume each node in the network measures data that satisfy the 
conditions described in Sec. 6.1. Consider two modes of operation. In one mode, each node in the network 
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runs the adaptive diffusion algorithm (204)-(206). In the second mode, each node operates individually and 
runs the non-cooperative LMS algorithm (207). In both cases, the positive step-sizes are chosen to satisfy 
the required stability conditions (251) and (252), which in this case are met by 

Mfc < t k = l,2,...,N (256) 

It then holds that the magnitude of the mean error vector, ||EiOj||, in the diffusion case decays to zero more 
rapidly than in the non-cooperative case. In other words, diffusion cooperation enhances convergence rate. 

□ 

The results of the previous theorems again highlight the following important facts about the role of the 
combination matrices {A\, A2, C} in the convergence behavior of the adaptive diffusion strategy (201)-(203): 

(a) The matrix C influences the mean stability of the network through its influence on the bound in (250). 
This is because the matrices {Rk} depend on the entries of C. The matrices {Ai, A2} do not influence 
network mean stability. 

(b) The matrices {A\, A2,C} influence the rate of convergence of the mean weight-error vector over the 
network since they influence the spectral radius of the matrix A\ (Jnm — M.TZ) Af, which controls the 
dynamics of the weight error vector in (248). 



6.5 Mean-Square Stability 

It is not sufficient to ensure the stability of the weight-error vector in the mean sense. The error vectors, 
Wk,i, may be converging on average to zero but they may have large fluctuations around the zero value. 
We therefore need to examine how small the error vectors get. To do so, we perform a mean-square-error 
analysis. The purpose of the analysis is to evaluate how the variances E UtOfc^H evolve with time and what 
their steady-state values are, for each node k. 

In this section, we are particularly interested in evaluating the evolution of two mcan-square-errors, 
namely, 

E||€» M || a and E|e a , fe (i)| 2 (257) 

The steady-state values of these quantities determine the MSD and EMSE performance levels at node k 
and, therefore, convey critical information about the performance of the network. Under the independence 
assumption on the regression data from Sec. 6.1, it can be verified that the EMSE variance can be written 
as: 

E|e a , fc (i)| 2 = Elufc.iffifc.i-xl 3 

= E [E(tOfc ii _ x ttfc )i ttfc )i tOfc > i_i|i&fc > i)] 
= Effi^i \Eu% ti u k ,i] Wk,i-i 

= E||™ M -illL fc (258) 

in terms of a weighted square measure with weighting matrix R u ^. Here we are using the notation ||x|||; to 
denote the weighted square quantity x*T,x, for any column vector x and matrix E. Thus, we can evaluate 
mean-square-errors of the form (257) by evaluating the means of weighted square quantities of the following 
form: 

E||t5 M ||| fc (259) 

for an arbitrary Hermitian nonnegative-definite weighting matrix E& that we are free to choose. By setting 
Efe to different values (say, Efc = I or E^ = R u ,k), we can extract various types of information about 
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the nodes and the network, as the discussion will reveal. The approach we follow is based on the energy 
conservation framework of [4, 5, 57]. 

So, let S denote an arbitrary N x N block Hcrmitian nonnegative-definite matrix that we are free to 
choose, with M x M block entries {S^}. Let a denote the (NM) 2 x 1 vector that is obtained by stacking 
the columns of E on top of each other, written as 



vec(E) 



(260) 



In the sequel, it will become more convenient to work with the vector representation a than with the matrix 
£ itself. 

We start from the weight-error vector recursion (246) and re-write it more compactly as: 



w, = BiWi-i - Qs i} i > 



where the coefficient matrices B, and Q are short-hand representations for 



B, 



A (Inm - MTLi) Al 



and 



G = AlMC T 



Note that Bi is stochastic and time-variant, while Q is constant. We denote the mean of Bi by 



B = EBi = Al{I NM -MK)Al 



where 1Z is defined by (181). Now equating weighted square measures on both sides of (261) we get 



7 -tils 



Expanding the right-hand side we find that 

||tt>i||| = w^BiEBiWi-i + s*g T Y>gsi- 
w^B^Gsi - s*e T £B^_i 

Under expectation, the last two terms on the right-hand side evaluate to zero so that 



(261) 

(262) 
(263) 

(264) 
(265) 

(266) 
(267) 



E||5y| = E (w*_ 1 B*T,BiWi-ij + E (s*Q T T,Qsi) 
Let us evaluate each of the expectations on the right-hand side. The last expectation is given by 

E(s*S T £S Si ) = Tr (g T HgEs t s*) 

{2 = ] Tr(g T i:gs) 

= Tr (Y,gSg T ) (268) 

where S is defined by (241) and where we used the fact that Ti(AB) = Tr(BA) for any two matrices A and 
B of compatible dimensions. With regards to the first expectation on the right-hand side of (267), we have 

E (tSJ.iSJEBitCi-i) = E [E (w^B^BiWi-^Wi-r)] 
= Ew*_ 1 [E (B*T,Bi)] Wi-i 



Etu^E'iWj-i 
ElltOi-iH!, 



(269) 
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where we introduced the nonncgativc-dcfinitc weighting matrix 



E' ± 



( = 2) EAi(I NM 



KiM) AzY<Al (I NM - MTL%) Aj 



= AxA^AlAl - AiA 2 ?.Ai,M1lAi - A 1 1ZMA 2 ZA 1 2 A{ + 0(M 



T A T 



(270) 



where 1Z is defined by (181) and the term 0(A4 2 ) denotes the following factor, which depends on the square 



of the step-sizes, {fi 2 .}: 



0(M 2 



E {Ain t MA 2 T,AlM'R. l Al) 



(271) 



The evaluation of the above expectation depends on higher-order moments of the regression data. While we 
can continue with the analysis by taking this factor into account, as was done in [4,5, 18,57], it is sufficient 
for the exposition in this article to focus on the case of sufficiently small step-sizes where terms involving 
higher powers of the step-sizes can be ignored. Therefore, we continue our discussion by letting 



E' = AiAiEA^Ai - AiA-zVAlMTlAi - AiRMAtEJ&AJ 



(272) 



The weighting matrix E' is fully defined in terms of the step-size matrix, the network topology through 
the matrices {Ai,A 2 ,C}, and the regression statistical profile through 1Z. Expression (272) tells us how to 
construct E' from E. The expression can be transformed into a more compact and revealing form if we 
instead relate the vector forms a' = vec(E') and a = vec(E). Using the following equalities for arbitrary 
matrices {U, W, E} of compatible dimensions [5]: 



vec(C/EW0 



{W T ®U)a 



Tr(EW0 = [vcc(W T )] T a 
and applying the vec operation to both sides of (272) we get 

a' = [AiA 2 ® AiA 2 ) <j - (Ai1l T MA 2 <g> AiA 2 ) a - (A1A2 ® A{RMA 2 ) a 

That is, 



(273) 
(274) 



a = J-a 



where we are introducing the coefficient matrix of size (NAI) 2 x (NMy 



T = {AiA 2 ® AiA 2 ) - {A 1 K T MA 2 ® AiA 2 ) - (AiA 2 ® A 1 11MA 2 ) 



A reasonable approximate expression for T for sufficiently small step-sizes is 



T « B <8> B* 



(275) 



(276) 



(277) 



Indeed, if we replace B from (264) into (277) and expand terms, we obtain the same factors that appear in 
(276) plus an additional term that depends on the square of the step-sizes, {/^}, whose effect can be ignored 
for sufficiently small step-sizes. 

In this way, using in addition property (274), we find that relation (267) becomes: 



|||, + [vec(gS T G T )] a 



(278) 



54 



The last term is dependent on the network topology through the matrix Q, which is defined in terms of 
{A2,C, A4}, and the noise and regression data statistical profile through S. It is convenient to introduce the 
alternative notation ||xj| 2 to refer to the weighted square quantity ||x||§;, where a = vec(E). We shall use 
these two notations interchangeably. The convenience of the vector notation is that it allows us to exploit 
the simpler linear relation (275) between a' and a to rewrite (278) as shown in (279) below, with the same 
weight vector a appearing on both sides. 

Theorem 6.6. (Variance Relation,) Consider the data model of Sec. 6.1 and the independence statistical 
conditions imposed on the noise and regression data, including (208)-(215). Assume further sufficiently 
small step-sizes are used so that terms that depend on higher-powers of the step-sizes can be ignored. Pick 
left stochastic matrices A\ and A2 and a right stochastic matrix C satisfying (166). Under these conditions, 
the weight-error vector Wi = co\{wk.i}^ =1 associated with a network running the adaptive diffusion strategy 
(201)-(203) satisfies the following variance relation 



E\\wi 



= EWibi 



vcc 



(y T )] T ° 



(279) 



for any Hermitian nonnegative- definite matrix S with a — vec(E), and where {S, G,J~} are defined by (241), 
(263), and (277), and 



y = gsg T (280) 

□ 

Note that relation (279) is not an actual recursion; this is because the weighting matrices {a, Fa} on both 
sides of the equality arc different. The relation can be transformed into a true recursion by expanding it 
into a convenient state-space model; this argument was pursued in [4,5,18,57] and is not necessary for the 
exposition here, except to say that stability of the matrix J- ensures the mean-square stability of the filter 
— this fact is also established further ahead through relation (327). By mean-square stability we mean that 
each term E HtH^H 2 remains bounded over time and converges to a steady-state MSD& value. Moreover, the 
spectral radius of T controls the rate of convergence of E ||w,|| 2 towards its steady-state value. 



Theorem 6.7. (Mean- Square Stability,) Consider the same setting of Theorem 6.6. The adaptive dif- 
fusion strategy (201)-(203) is mean-square stable if, and only if, the matrix T defined by (276), or its 
approximation (277), is stable (i.e., all its eigenvalues lie strictly inside the unit disc). This condition is 
satisfied by sufficiently small positive step-sizes {pk} that are necessarily smaller than the following bound: 

A max (i?fc) 

where the neighborhood covariance matrix Rk is defined by (182). Moreover, the convergence rate of the 
algorithm is determined by the value \p(B)] 2 (the square of the spectral radius of B). 

Proof. Recall that, for two arbitrary matrices A and B of compatible dimensions, the eigenvalues of the Kronecker 
product A® B is formed of all product combinations \i(A)Xj(B) of the eigenvalues of A and B [19]. Therefore, using 
expression (277), we have that p(J-) = [p(B)] 2 . It follows that T is stable if, and only if, B is stable. We already noted 
earlier in Theorem 6.1 that condition (281) ensures the stability of B. Therefore, step-sizes that ensure stability in 
the mean and are sufficiently small will also ensure mean-square stability. 

□ 

Remark. More generally, had we not ignored the second-order term (271), the expression for T would have 
been the following. Starting from the definition E' = E8*EBj, we would get 

E Bj ® B* 
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so that 

T = e(bJ®B?) (for general step-sizes) (282) 
= {A 1 ® A x ) • {/ - (TZ T M ® /) - (J ® TIM) + E (nfM ® 7?- 4 x) } • (-4 a g> A 2 ) 

Mean-square stability of the filter would then require the step-sizes {fik} to be chosen such that they ensure 
the stability of this matrix T (in addition to condition (281) to ensure mean stability). 

□ 



6.6 Network Mean-Square Performance 

We can now use the variance relation (279) to evaluate the network performance, as well as the performance 
of the individual nodes, in steady-state. Since the dynamics is mean-square stable for sufficiently small 
step-sizes, we take the limit of (279) as i — >• oo and write: 



limElltfliC = lim EH^.xll^ + [vec(^ T )] a 



(283) 



Grouping terms leads to the following result. 

Corollary 6.1. (Steady-State Variance Relation,) Consider the same setting of Theorem 6.6. The 
weight-error vector, Wi = co\{wk,i}k=it of the adaptive diffusion strategy (201)-(203) satisfies the following 
relation in steady- state: 



lim 



; i|l(/-JF)<T 



[vec(y T )] T a 



(284) 



for any Hermitian nonnegative-definite matrix E with a = vec(E), and where {F,y} are defined by (277) 
and (280). 

□ 

Expression (284) is a very useful relation; it allows us to evaluate the network MSD and EMSE through 
proper selection of the weighting vector a (or, cquivalently, the weighting matrix E). For example, the 
network MSD is defined as the average value: 



MSD 



1 N 

network A E 



w k . 



(285) 



fc=i 



which amounts to averaging the MSDs of the individual nodes. Therefore, 



MSD 



network 



lim -j=E||Si| 



lim E /jv 



(286) 



This means that in order to recover the network MSD from relation (284), we should select the weighting 
vector a such that 

(I - F)o- = —vcc(I NM ) 

Solving for a and substituting back into (284) we arrive at the following expression for the network MSD: 



MSD nctwork = j_ . [ yec (yT^ T (/ _ ^ _ 1 ^ (/ ^ } 



(287) 
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Likewise, the network EMSE is denned as the average value 



EMSE 1 



1 N 

i— >oo 1\ * — ' 



fe=l 
A 



= Um — V E 

i->oo N ^— ' 



\ w k,i\\Ru 



k=l 



which amounts to averaging the EMSEs of the individual nodes. Therefore, 



EMSE 



network 



i 



i 



lim — E ||tUj||j;„„r D D D i = lim — l&\\w 



where 7?. u is the matrix defined earlier by (184), and which we repeat below for ease of reference: 

1Z U = diag{i?. Ui i, i? Uj 2, • ■ • , Ru,n} 



(288) 



(289) 



(290) 



This means that in order to recover the network EMSE from relation (284), we should select the weighting 
vector a such that 

{l-T)o = ivec(ft„) (291) 
Solving for a and substituting into (284) we arrive at the following expression for the network EMSE: 



EMSE „etwork = 1 . ^ yec (yT^T (/ _ _ ^ ^ 



(292) 



6.7 Mean-Square Performance of Individual Nodes 

We can also assess the mean-square performance of the individual nodes in the network from (284). For 
instance, the MSD of any particular node k is defined by 



A 



MSD fc = lim E||«jfci| 



(293) 



Introduce the N x N block diagonal matrix with blocks of size M x M, where all blocks on the diagonal are 
zero except for an identity matrix on the diagonal block of index k, i.e., 



Jk = diag{ M , . . . , M , I M , M , . . . , Ojvf } 
Then, we can express the node MSD as follows: 



A 



MSD fc = lim IE ll-ii},: 1,2 



Jk 



The same argument that was used to obtain the network MSD then leads to 



MSD* 



[vec (y T )] -(I-T)- 1 -vec(J k ) 



(294) 



(295) 



(296) 



Likewise, the EMSE of node k is defined by 



EMSEfe 



lim E|e 0l fc(i)| 



lim E\\w ktt \\ 2 R 

i— >oo ' 



(297) 
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Introduce the N x N block diagonal matrix with blocks of size M x M, where all blocks on the diagonal are 
zero except for the diagonal block of index k whose value is R Ui k, he., 



Tk = diag{ M , . . . , A /, R u ,k, Omj • ■ • 7 Oa/ } 
Then, we can express the node EMSE as follows: 

EMSE fe 

The same argument that was used to obtain the network EMSE then leads to 



lim E\\wi\$ k 



EMSEt 



[vec(y T )Y .(J-J-)-i. V ec(r fe ) 



(298) 
(299) 

(300) 



We summarize the results in the following statement. 

Theorem 6.8. (Network Mean-Square Performance ) Consider the same setting of Theorem 6.6. In- 
troduce the 1 x (N M) 2 row vector h T defined by 



' T A 



[vec(y T )] T -(I-^ 



(301) 



where {J 7 , y} are defined by (277) and (280). Then the network MSD and EMSE and the individual node 
performance measures are given by 



MSD notwork 
EM g E network 

MSDfc 

EMSE fe 

where {jTk,Tk} are defined by (294) an d (298). 



h T ■vec(I NM )/N 
h T -vec (Ku)/N 
h T ■ vec ( J k ) 
h T -vec(T k ) 



(302) 
(303) 
(304) 
(305) 

□ 

We can obviously recover from the above expressions the performance of the nodes in the non-cooperative 
implementation (207), where each node performs its adaptation individually, by setting Ai = A2 — C = In- 
We can express the network MSD, and its EMSE if desired, in an alternative useful form involving a 
series representation. 

Corollary 6.2. (Series Representation for Network MSDJ Consider the same setting of Theorem 6.6. 
The network MSD can be expressed in the following alternative series expansion form: 



MSD 



network 



00 

= Tr(WT 



where 

y = QSQ T 

g = AlMC T 

B = Aj{I - MTZ)Aj 
Proof. Since J- is stable when the filter is mean-square stable, we can expand (7 — J-)^ 1 as 

(/- rr 1 = 1 + t + t 2 + ... 

( = 7) 1 + (b t ®b*) + (b t ®b> 



{306) 



(307) 
(308) 
(309) 



Substituting into (287) and using property (274), we obtain the desired result. 



□ 
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6.8 Uniform Data Profile 



We can simplify expressions (307)-(309) for {y, Q,B} in the case when the regression covariance matrices 
are uniform across the network and all nodes employ the same step-size, i.e., when 



Ru,k = Ru, for all k (uniform covariance profile) 
fik — (i, for all k (uniform step-sizes) 

and when the combination matrix C is doubly stochastic, so that 



CT = 1, C J 1 = 1 



(310) 
(311) 



(312) 



We refer to conditions (310)-(312) as corresponding to a uniform data profile environment. The noise 
variances, {a 2 fe }, do not need to be uniform so that the signal-to- noise ratio (SNR) across the network can 
still vary from node to node. The simplified expressions derived in the sequel will be useful in Sec. 7 when 
we compare the performance of various cooperation strategies. 

Thus, under conditions (310)-(312), expressions (180), (181), and (263) for {A4 7 1Z, Q} simplify to 



M 
K 
Q 



1^1 NM 

In <& Ru 



(313) 
(314) 
(315) 



Substituting these values into expression (309) for B we get 



B = A%(I-MK)Aj 

= {Al ® 7) • (7 - n{I ® Ru)) ■ (Aj ® 7) 

= {Al®I)(A\ 

= {A T 2 Al®I) 

= A\A\ (g> (I - ijlR u ) 



I) - n(Al ® ® R u )(Aj ® I) 
Il(A t 2 A\ ® Ru) 



where we used the useful Kronecker product identities: 

(X + Y) <g> Z 
(X ®Y)(W <g> Z) 



(X®Z) + {Y®Z) 
{XW®YZ) 



(316) 



(317) 
(318) 



for any matrices {X, Y, Z, W} of compatible dimensions. Likewise, introduce the N x N diagonal matrix 
with noise variances: 



R v = diag{<^ i, o% 2 , a 2 vN } 



(319) 



Then, expression (241) for S becomes 

S = diag{al A R u , al 2 R u , . . . , al N R u } 
= R v ® R u 

It then follows that we can simplify expression (307) for y as: 

y = fi 2 A%C T SCA 2 

= fi 2 ■ {Al ® I) ■ (C T (8> 7) ® (Rv ® Ru) ■ (C ® 7) • (A 2 <g> 7) 
= fi 2 (A 2 r C T R v CA 2 ®Ru) 



(320) 



(321) 
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Corollary 6.3. (Network MSD for Uniform Data Profile^ Consider the same setting of Theorem 6.6 
with the additional requirement that conditions (310)— (312) for a uniform data profile hold. The network 
MSD is still given by the same series representation (306) where now 



y = ^ 2 {AlC T R v CA 2 ®R u ) 
B = AlAl ®(I - fiRu) 



(322) 
(323) 



Using these expressions, we can decouple the network MSD expression (306) into two separate factors: one 
is dependent on the step-size and data covariance {fx, R u }, and the other is dependent on the combination 
matrices and noise profile {Ai, A 2 , C, R v }: 



SD nctwork = M_ Tr 



N 



3=0 



(AjAiy ( 



Ai C 1 R V CA 2 ) (A^Y 



[(I - /iR u y R U (I - uR u ) 



(324) 



Proof. Using (306) and the given expressions (322)-(323) for {y,B}, we get 



MSD 



network 



J=0 



{Ai C 1 R V CA 2 ® R 



u ) [(AiA 2 ) j ® (I-aRuY^j 



Result (324) follows from property (317). 



□ 



6.9 Transient Mean-Square Performance 

Before comparing the mean-square performance of various cooperation strategies, we pause to comment that 
the variance relation (279) can also be used to characterize the transient behavior of the network, and not 
just its steady-state performance. To see this, iterating (279) starting from i — 0, we find that 



Wi 



+ [™c(y T )] T - E-^ 



(325) 



where 



A c 
= W 



(326) 

in terms of the initial condition, W—\. If this initial condition happens to be = 0, then W-i = w°. 
Comparing expression (325) at time instants i and i — 1 we can relate E Hi&i ||^ and E l^i-i \\& as follows: 



E\\wi 



vec 



(y T )\ 



(327) 



This recursion relates the same weighted square measures of the error vectors {u)i, Wi^i}. It therefore 
describes how these weighted square measures evolve over time. It is clear from this relation that, for mean- 
square stability, the matrix T needs to be stable so that the terms involving T 1 do not grow unbounded. 

The learning curve of the network is the curve that describes the evolution of the network EMSE over 
time. At any time i, the network EMSE is denoted by ((i) and measured as: 



A 



1 N 

^E E i e ^(*)i 



(328) 



fc=i 

N 



W k,i\\R u , k 



fe=l 
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The above expression indicates that is obtained by averaging the EMSE of the individual nodes at time 
i. Therefore, 



1 



C(») = ^\\™ 



where 1Z U is the matrix defined by (290). This means that in order to evaluate the evolution of the network 
EMSE from relation (327), we simply select the weighting vector a such that 



1 1 1 "Ru 



(329) 



a = —vcc(TZ u ) 

Substituting into (327) we arrive at the learning curve for the network. 



(330) 



Corollary 6.4. (Network Learning Curve ) Consider the same setting of Theorem 6.6. Let £(z) denote 
the network EMSE at time i, as defined by (328). Then, the learning curve of the network corresponds to 
the evolution of Q(i) with time and is described by the following recursion over i > 0: 



(331) 




where {T,y,K u } are defined by (277), (280), and (290). 



□ 



7 Comparing the Performance of Cooperative Strategies 

Using the expressions just derived for the MSD of the network, we can compare the performance of various 
cooperative and non-cooperative strategies. Table 6 further ahead summarizes the results derived in this 
section and the conditions under which they hold. 



7.1 Comparing ATC and CTA Strategies 

We first compare the performance of the adaptive ATC and CTA diffusion strategies (153) and (154) when 
they employ a doubly stochastic combination matrix A. That is, let us consider the two scenarios: 

C, A 1 = A, A 2 = I N (adaptive CTA strategy) (332) 

C, Ai = I N , A 2 = A (adaptive ATC strategy) (333) 

where A is now assumed to be doubly stochastic, i.e., 

Al = t, A T t = 1 (334) 

with its rows and columns adding up to one. For example, these conditions are satisfied when A is left 
stochastic and symmetric. Then, expressions (307) and (309) give: 

S c ta = (I-MK)A T , y cta = MC T SCM (335) 
Sate = A T (I-MK), y atc = A T MC T SCMA (336) 

where 

A = A ® I M (337) 
Following [18], introduce the auxiliary nonncgativc-definite matrix 

Hj = [{I - MK)A T ] j ■ MC T SCM ■ [{I - MK)A T ] * J (338) 
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Then, it is immediate to verify from (306) that 



MSD™ rk 



MSD „etwork 



oo 

3=0 
oo 

-J2^(A T Hi A) (340) 



N 

3=0 



so that 

oo 

MSD „etwork _ M g D „ctwork = L^Tt (%, - A* HjA) (341) 

3=0 

Now, since A is doubly stochastic, it also holds that the enlarged matrix A is doubly stochastic. Moreover, 
for any doubly stochastic matrix A and any nonnegativc-definitc matrix % of compatible dimensions, it 
holds that (see part (f) of Theorem C.3): 

Tr{A T HA) < Tr(-H) (342) 
Applying result (342) to (341) we conclude that 



MSD™ twork < MSD™* work 



(doubly stochastic A) (343) 



so that the adaptive ATC strategy (153) outperforms the adaptive CTA strategy (154) for doubly stochastic 
combination matrices A. 

7.2 Comparing Strategies with and without Information Exchange 

We now examine the effect of information exchange (C ^ I) on the performance of the adaptive ATC and 
CTA diffusion strategies (153)-(154) under conditions (310)-(312) for uniform data profile. 

CTA Strategies 

We start with the adaptive CTA strategy (154), and consider two scenarios with and without information 
exchange. These scenarios correspond to the following selections in the general description (201)-(203): 

C I, Ai = A, A2 = In (adaptive CTA with information exchange ) (344) 

C = I, Ai = A, A2 = In (adaptive CTA without information exchange) (345) 

Then, expressions (322) and (323) give: 

iW^i = A T ®{I-nR u ), y cta . c ^i = ^{C T R V C®R U ) (346) 
B ct a,c=i = A T ®{I-^R U ) 1 y cta , c =i = ii 2 {R v ®Ru) (347) 

where the matrix R v is defined by (319). Note that Bcta.c^i — £>cta,c=i, so we denote them simply by B in 
the derivation that follows. Then, from expression (306) for the network MSD we get: 

2 °° 

MSD~*-MSD~£ = >LYl Tr ( Bj [( R »- CTR v C )® R "} B * j ) ( 348 ) 

3=0 

It follows that the difference in performance between both CTA implementations depends on how the matrices 
R v and C T R V C compare to each other: 
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(1) When R v - C T R V C > 0, we obtain 



MSD c n ™ > MSD~ k / 



(when C T R V C < R v ) 



(349) 



so that a CTA implementation with information exchange performs better than a CTA implementation 
without information exchange. Note that the condition on {R V ,C} corresponds to requiring 

C T R V C < R v (350) 

which can be interpreted to mean that the cooperation matrix C should be such that it does not 
amplify the effect of measurement noise. For example, this situation occurs when the noise profile is 
uniform across the network, in which case R v = <j 2 Im- This is because it would then hold that 

R v - C T R V C = o*(I - C T C) > (351) 

in view of the fact that (/ — C T C) > since C is doubly stochastic (cf. property (e) in Lemma C.3). 

(2) When R v - C T R V C < 0, we obtain 



network 



(when C T R V C > R v ) 



(352) 



so that a CTA implementation without information exchange performs better than a CTA implementa- 
tion with information exchange. In this case, the condition on {R v , C} indicates that the combination 
matrix C ends up amplifying the effect of noise. 



ATC Strategies 

We can repeat the argument for the adaptive ATC strategy (153), and consider two scenarios with and with- 
out information exchange. These scenarios correspond to the following selections in the general description 
(201)-(203): 

C I, A\ = In, A2 = A (adaptive ATC with information exchange ) (353) 

C = I, Ai = In, A2 — A (adaptive ATC without information exchange) (354) 

Then, expressions (322) and (323) give: 

B«tcc,a = A T ®(I-/iR u ), JVcvi = fi 2 (A T C T R V C A ® R u ) (355) 
S atc ,c=i = A T ® (/ - vRu), 3>atc,c=i = ^ 2 {A T R V A®R U ) (356) 

Note again that 2? a tc,c/i = £> a t c ,c=ij so we denote them simply by £>. Then, 

2 00 

MSD"^c= k j - MSD™*^ = j- J2 Tr [ AT (Rv -C T R V C)A® R u ] B*>) (357) 

j=o 

It again follows that the difference in performance between both ATC implementations depends on how the 
matrices R v and C T R V C compare to each other and we obtain: 



MSD 



network 
atc,C=J 



> MSD 



network 



(when C T R V C < R v ) 



(358) 



and 



MSD~ k 7 



< 



MSD~ k , 



(when C T R V C > R v ) 



(359) 
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Tabic 6: Comparison of the MSD performance of various cooperative strategies. 



Comparison 


Conditions 


MSD , a \ c c twork < MSD^ work 


A doubly stochastic, C right stochastic. 


MSD££S& < MSD^ w c°=/ 


C T R V C < R v , C doubly stochastic, R u ,k = Ru, fJ-k = 


MSD^ W C °* < MSD»tcw 


C T R V C > R v , C doubly stochastic, R u ,k = Ru, Hk = 


MSD- tw c°*/ < MSD- tw c°=/ 


C T R V C < R v , C doubly stochastic, R u> k = Ru, Hk = 


MSD- tw c°=/ < MSD- tw c°*/ 


C T R V C > Rv, C doubly stochastic, R Ui k = Ru, Hk = fJ- 


MSD'^ c twork < MSD^ work < MSDi^ work 


{A, C} doubly stochastic, R u ,k = Ru, fJ-k = pi. 



7.3 Comparing Diffusion Strategies with the Non-Cooperative Strategy 

We now compare the performance of the adaptive CTA strategy (154) to the non-cooperative LMS strategy 
(207) assuming conditions (310)-(312) for uniform data profile. These scenarios correspond to the following 
selections in the general description (201)-(203): 

C, Ai = A, A 2 =I (adaptive CTA) (360) 
C = I, Ai = I, A 2 = I (non-cooperative LMS) (361) 

where A is further assumed to be doubly stochastic (along with C) so that 

At = 1, A T t = 1 (362) 

Then, expressions (322) and (323) give: 

Seta = A T ®(I~fiR u ), y ct& = ^ 2 (C T R V C®R U ) (363) 
B ims = I®(I-iiRu), Vims = n 2 {R v ®Ru) (364) 

Now recall that 

C = C® I M (365) 
so that, using the Kronecker product property (317), 

J^cta = ii 2 (C T R v C®R u ) 

= \l 2 {C t ® I M ){Rv <8> Ru)(C ® I M ) 

= fJ 2 C T (R v (E)Ru)C 

= C T y lms C (366) 

Then, 



"'eta J 



J=0 j=0 

1 OO 1 OO 

J=0 j=0 

1 OO 

= aF E Tr [(« s - CB*J a Bi ta C T ) y ims ] (367) 



3=0 



G4 



Let us examine the difference: 

°lms°lms uo cta°cta u 



= (I ® (I - fiRu) 23 ) - ® (I - uRaY) (A? T C T ® (/ - 

( = 7) (i^il-iiRufi) - (CAiAi T C T ®(I- t iR u fi) 
= (I- CA^A> T C T ) ® (J - Mi?u) 2j 



(368) 

Now, due to the even power, it always holds that (J — ^R u ) 2j > 0. Moreover, since A? and C are doubly 
stochastic, it follows that CA>Ai T C T is also doubly stochastic. Therefore, the matrix (J — CA^A^ T C T ) is 
nonnegative-deffnite as well (cf. property (c) of Lemma C.3). It follows that 



lms 1ms 



CBILB^C 1 > 



But since D^ims > 0, we conclude from (367) that 



MSD 



network 
lms 



> MSD 



network 



(369) 



(370) 



This is because for any two Hcrmitian nonnegative-definite matrices A and B of compatible dimensions, it 
holds that Tr(AB) > 0; indeed if we factor B = XX* with X full rank, then Tr(AB) = Ti{X* AX) > 0. We 
conclude from this analysis that adaptive CTA diffusion performs better than non-cooperative LMS under 
uniform data profile conditions and doubly stochastic A. If we refer to the earlier result (343), we conclude 
that the following relation holds: 



MSD^ t c c twork < MSD™ rk < MSDj^ s work 



(371) 



Table 6 lists the comparison results derived in this section and lists the conditions under which the conclusions 
hold. 



8 Selecting the Combination Weights 

The adaptive diffusion strategy (201)-(203) employs combination weights {ai t ek,d2,ik,C£k} or, equivalently, 
combination matrices {A\, A2,C}, where A\ and A2 are left-stochastic matrices and C is a right-stochastic 
matrix. There arc several ways by which these matrices can be selected. In this section, we describe 
constructions that result in left-stochastic or doubly-stochastic combination matrices, A. When a right- 
stochastic combination matrix is needed, such as C, then it can be obtained by transposition of the left- 
stochastic constructions shown below. 

8.1 Constant Combination Weights 

Table 7 lists a couple of common choices for selecting constant combination weights for a network with N 
nodes. Several of these constructions appeared originally in the literature on graph theory. In the table, the 
symbol nk denotes the degree of node k, which refers to the size of its neighborhood. Likewise, the symbol 
"max refers to the maximum degree across the network, i.e., 

"max = max {n k } (372) 

l<fc<7V 

The Laplacian rule, which appears in the second line of the table, relies on the use of the Laplacian matrix 
C of the network and a positive scalar 7. The Laplacian matrix is defined by (574) in App. B, namely, it is 
a symmetric matrix whose entries are constructed as follows [64-66]: 

r n fc -i, i£k = e 

[C] k i = \ — 1, if k ^ i and nodes k and i are neighbors (373) 
0, otherwise 
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The Laplacian rule can be reduced to other forms through the selection of the positive parameter 7. One 
choice is 7 = l/n ma xj while another choice is 7 = 1/N and leads to the maximum-degree rule. Obviously, 
it always holds that n max < N so that l/n max > 1/N. Therefore, the choice 7 = l/n max ends up assigning 
larger weights to neighbors than the choice 7 = 1/N. The averaging rule in the first row of the table is one 
of the simplest combination rules whereby nodes simply average data from their neighbors. 



Table 7: Selections for combination matrices A = {ait. 



Entries of Combination Matrix A 


Type of A 


1. Averaging rule [68]: 

J 1 /rik , if k 7^ I are neighbors or k — £ 
aik \ 0, otherwise 


left-stochastic 


2. Laplacian rule [49,69]: 
A = I N - yC, 7 > 


symmetric and 
doubly-stochastic 


3. Lap 

aik = | 


lacian rule using 7 = 1/ n max : 

1/^max, if k 7^ £ are neighbors 
1 — (rife — l)/n m ax, k = £ 
0, otherwise 


symmetric and 
doubly-stochastic 


4. Lap 

aik = | 


lacian rule using 7 = l/A r (maximum-degree rule [50]) : 

1/N, if k 7^ £ axe neighbors 

1 - (n h - 1)/N, k = £ 

0, otherwise 


symmetric and 
doubly-stochastic 


5. Mel 

aik = 


;ropolis rule [49,70,71]: 

1/ max{nk, ni }, if k 7^ £ are neighbors 

, 1- aek ' k = £ 

ieu k \{k} 
0, otherwise 


symmetric and 
doubly-stochastic 


6. Rel 

a tk = < 


ative-degree rule [72]: 

' ( \ 

m/ 1 n m , if fc and £ are neighbors or k = £ 

\meM k J 
0, otherwise 

>■ 


left-stochastic 



In the constructions in Table 7, the values of the weights {aik} are largely dependent on the degree of the 
nodes. In this way, the number of connections that each node has influences the combination weights with 
its neighbors. While such selections may be appropriate in some applications, they can nevertheless degrade 
the performance of adaptation over networks [54] . This is because such weighting schemes ignore the noise 
profile across the network. And since some nodes can be noisier than others, it is not sufficient to rely solely 
on the amount of connectivity that nodes have to determine the combination weights to their neighbors. It is 
important to take into account the amount of noise that is present at the nodes as well. Therefore, designing 
combination rules that are aware of the variation in noise profile across the network is an important task. 
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It is also important to devise strategies that are able to adapt these combination weights in response to 
variations in network topology and data statistical profile. For this reason, following [58], we describe in the 
next subsection one adaptive procedure for adjusting the combination weights. This procedure allows the 
network to assign more or less relevance to nodes according to the quality of their data. 

8.2 Optimizing the Combination Weights 

Ideally, we would like to select N x N combination matrices {Ai, A 2 ,C} in order to minimize the network 
MSD given by (302) or (306). In [18], the selection of the combination weights was formulated as the following 
optimization problem: 



min MSD 

{A lt A 2 ,C} 



network 



given by (302) or (306) 



over left and right-stochastic matrices with nonnegative entries: 
Ajl = t, a Mfc = 0if^A4 
Afl = l, a 2Jk = Q if I £ N k 
CI = 1, c ik = iil^Mk 



(374) 



We can pursue a numerical solution to (374) in order to search for optimal combination matrices, as was 
done in [18]. Here, however, we are interested in an adaptive solution that becomes part of the learning 
process so that the network can adapt the weights on the fly in response to network conditions. Wc illustrate 
one approximate approach from [58] that leads to one adaptive solution that performs reasonably well in 
practice. 

We illustrate the construction by considering the ATC strategy (158) without information exchange where 
A\ = In, A2 — A, and C = I. In this case, recursions (204)-(206) take the form: 



ij) kti = + l-i k u* kl [d k (i) - Uk,iW k ,i-\} 

and, from (306), the corresponding network MSD performance is: 



(375) 
(376) 



SD- tWOTk = Tr(i3i tc y atc & 

3=0 



where 



(377) 



^atc 
J4tc 

n u 
s 



A T (I~MK U ) 
A T MSMA 

diag{i4,i, Ru,2, • ■ • , Ru,n] 

diag{er,y iRu,x, <rl 2 Ru,2, ■ ■ ■ , &v,nRu,n} 



M = diagj^iiM, M2^a/, ■ • ■ , UN hi) 
A = A®I M 



(378) 
(379) 
(380) 
(381) 
(382) 
(383) 



Minimizing the MSD expression (377) over left-stochastic matrices A is generally non-trivial. We pursue an 
approximate solution. 

To begin with, for compactness of notation, let r denote the spectral radius of the N x N block matrix 
/ - MK U : 

A 



P (I - MK U ) 



(384) 
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We already know, in view of the mean and mean-square stability of the network, that \r\ < 1. Now, consider 
the series that appears in (377) and whose trace we wish to minimize over A. Note that its block maximum 
norm can be bounded as follows: 



E S atc34tc# 



< 



(") 

< 



(6) 
< 



oo 

E 


R? 
"ate 






H 


f oo 

E 




\j=o 


H 


f OO 

X 






M 


' OO 







lla 



ate lift , 



B* 3 

"ate 



6,OC 



- ate 



||3^atc||i, )00 
<itc 1 1 h oo ■ llXtcll 



2j 



,2j 



a ^c || b,oc 



!_ r 2 

where for step (b) we use result (602) to conclude that 

Sate \\b,oo = 



ate Hi, oo 



A 1 (I - MKJlWoo 

^Ikoo'll 



< P T |koo-||/-X^ ll ||,,,oo 



(602) 



(385) 



To justify step (a), we use result (584) to relate the norms of B 3 atc and its complex conjugate, 



"ate 



(386) 
, as 
(387) 



Expression (385) then shows that the norm of the series appearing in (377) is bounded by a scaled multiple 
of the norm of D4,tc, and the scaling constant is independent of A. Using property (586) we conclude that 
there exists a positive constant c, also independent of A, such that 



Tr[EsL3W?ai] < c-Tr(34te) 



(388) 



Therefore, instead of attempting to minimize the trace of the series, the above result motivates us to minimize 
an upper bound to the trace. Thus, we consider the alternative problem of minimizing the first term of the 
series (377), namely, 

min TrQ4 tc ) 

(389) 

subject to A T 1 = 1, a lk > 0, a lk = if I g Af k 
Using (379), the trace of J4tc can be expressed in terms of the combination coefficients as follows: 



N N 



Tr(Xtc) = E E $ a « ^{Ru.t) 



(390) 



fe=i i=i 
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so that problem (389) can be decoupled into N separate optimization problems of the form: 



N 



{atk} 



mm 



51 A a \k 0"V Tr(R u/ ), k = l,...,N 



1=1 



(391) 




With each node £, we associate the following nonnegative noise-data-dependent measure: 



(392) 



This measure amounts to scaling the noise variance at node £ by fig and by the power of the regression 
data (measured through the trace of its covariance matrix) . We shall refer to 7I as the noise-data variance 
product (or variance product, for simplicity) at node £. Then, the solution of (391) is given by: 



We refer to this combination rule as the relative-variance combination rule [58]; it leads to a left-stochastic 
matrix A. In this construction, node k combines the intermediate estimates {ipg 4 } from its neighbors in (376) 
in proportion to the inverses of their variance products, {7„ 2 }- The result is physically meaningful. Nodes 
with smaller variance products will generally be given larger weights. In comparison, the following relative- 
degree-variance rule was proposed in [18] (a typo appears in Table III in [18], where the noise variances 
appear written in the table instead of their inverses): 



This second form also leads to a left-stochastic combination matrix A. However, rule (394) does not take 
into account the covariance matrices of the regression data across the network. Observe that in the special 
case when both the regression covariance matrices and the noise variances are uniform across the network, 
i.e., R U: k = R u and k = a% for all k, expression (393) reduces to the simple averaging rule (first line of 
Table 7). In contrast, expression (394) reduces the relative degree rule (last line of Table 7). 

8.3 Adaptive Combination Weights 

To evaluate the combination weights (393), the nodes need to know the variance products, {7™}, of their 
neighbors. According to (392), the factors {7^} are defined in terms of the noise variances, {c^ m }, and the 
regression covariance matrices, {Tr(i? Uim )}, and these quantities are not known beforehand. The nodes only 
have access to realizations of {d rn (i), u m> i}. We now describe one procedure that allows every node k to 
learn the variance products of its neighbors in an adaptive manner. Note that if a particular node £ happens 
to belong to two neighborhoods, say, the neighborhood of node k\ and the neighborhood of node k 2 , then 




(relative- variance rule) 



(393) 




(relative degree- variance rule) 



(394) 
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each of k\ and &2 need to evaluate the variance product, 7?, of node I. The procedure described below allows 
each node in the network to estimate the variance products of its neighbors in a recursive manner. 

To motivate the algorithm, we refer to the ATC recursion (375)-(376) and use the data model (208) to 
write for node t: 

ij) lti = we,i-i + (iei4,i [uz,iWe,i-i + v t (i)\ (395) 

so that, in view of our earlier assumptions on the regression data and noise in Sec. 6.1, we obtain in the limit 

as i — > 00: 

|2 .2 / i- w ||~ ||2 



1— >oo \ 1— > 00 



hm E ll^i-tu/.i-ijl = ni- \ \™n^-i\\ z E{u , MeA2uiA ) + Hi ■ at, f Tr(i?„, f ) (396) 



We can evaluate the limit on the right-hand side by using the steady-state result (284). Indeed, we select 
the vector a in (284) to satisfy 

(I-T)a = vecfE^llt^illVO] (397) 

Then, from (284), 



!™ E II s '-* = [ vcc (^)] ■(/-^- 1 -vec[E(ti; ii ||« <li ||Vi)] (398) 



Now recall from expression (379) for y that for the ATC algorithm under consideration we have 

y = A T MSMA (399) 

so that the entries of y depend on combinations of the squared step-sizes, {fJ^, m= 1,2,..., N}. This fact 
implies that the first term on the right-hand side of (396) depends on products of the form {/j^fj,^}; these 
fourth-order factors can be ignored in comparison to the second-order factor /j, 2 for small step-sizes so that 

lim E \\ip ti - Wi i-i\\ « fit • a le ' Ti(Ru,e) 

i— too 11 ' 

= It (400) 

in terms of the desired variance product, 7^. Using the following instantaneous approximation at node k 
(where u^,j-i is replaced by Wk^-i)- 

EH^-tB^-xll 2 « U^-w^W 2 (401) 

we can motivate an algorithm that enables node k to estimate the variance product, 7J, of its neighbor t. 
Thus, let 7|^.(i) denote an estimate for 7^ that is computed by node k at time i. Then, one way to evaluate 
7«cM i s through the recursion: 



iJkW = (1 - *>*) ■ llS ~ !) + v k ■ ~ w k y 



(402) 



where is a positive coefficient smaller than one. Note that under expectation, expression (402) becomes 

E7tt(<) = U - «*) ' E 7«(* - 1) + vh -E 11^, - ™M-if (403) 
so that in steady-state, as i — > 00, 

lim E~f 2 k (i) « (1 - Hfe) • Hm E 7 , 2 fc (i - 1) + ^ ■ 7? (404) 

Hence, we obtain 



limE 7 | fc (i)«7| (405) 

2— >-00 
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That is, the estimator ^tjAi) converges on average close to the desired variance product 7^. In this way, we 
can replace the optimal weights (393) by the adaptive construction: 



i££eAf k 



aik(i) 



0, 



otherwise 



(406) 



Equations (402) and (406) provide one adaptive construction for the combination weights {aik}- 

9 Diffusion with Noisy Information Exchanges 

The adaptive diffusion strategy (201)-(203) relies on the fusion of local information collected from neighbor- 
hoods through the use of combination matrices {A\, A2,C}. In the previous section, we described several 
constructions for selecting such combination matrices. We also motivated and developed an adaptive scheme 
for the ATC mode of operation (375)-(376) that computes combination weights in a manner that is aware 
of the variation of the variance-product profile across the network. Nevertheless, in addition to the mea- 
surement noises {vk{i)} at the individual nodes, we also need to consider the effect of perturbations that 
are introduced during the exchange of information among neighboring nodes. Noise over the communication 
links can be due to various factors including thermal noise and imperfect channel information. Studying 
the degradation in mean-square performance that results from these noisy exchanges can be pursued by 
straightforward extension of the mean-square analysis of Sec. 6, as we proceed to illustrate. Subsequently, 
we shall use the results to show how the combination weights can also be adapted in the presence of noisy 
exchange links. 

9.1 Noise Sources over Exchange Links 

To model noisy links, we introduce an additive noise component into each of the steps of the diffusion strategy 
(201)-(203) during the operations of information exchange among the nodes. The notation becomes a bit 
cumbersome because we need to account for both the source and destination of the information that is being 
exchanged. For example, the same signal dg(i) that is generated by node I will be broadcast to all the 
neighbors of node I. When this is done, a different noise will interfere with the exchange of dg(i) over each 
of the edges that link node £ to its neighbors. Thus, we will need to use a notation of the form dffc(i), with 
two subscripts I and k, to indicate that this is the noisy version of dg(i) that is received by node k from node 
t. The subscript Ik indicates that £ is the source and k is the sink, i.e., information is moving from £ to k. 
For the reverse situation where information flows from node k to £, we would use instead the subscript kl. 
With this notation in mind, we model the noisy data received by node k from its neighbor £ as follows: 



where '. j (M x 1), vy k i [M x 1), and Vgt '. (1 x M) are vector noise signals, and v ek (i) is a scalar 
noise signal. These are the noise signals that perturb exchanges over the edge linking source £ to sink k 
(i.e., for data sent from node £ to node k). The superscripts {(w), ("0), (u), (d)} in each case refer to the 
variable that these noises perturb. Figure 14 illustrates the various noise sources that perturb the exchange 
of information from node £ to node k. The figure also shows the measurement noises {vg(i), Vk{i)} that exist 
locally at the nodes in view of the data model (208). 



Wtk,i-1 = V>t,i-1 + v £k'i-l 



Ue k ,i = Ul t i + v} k 'j 

dik{i) = di(i) + vf k \i) 



(407) 
(408) 
(409) 
(410) 
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Figure 14: Additive noise sources perturb the exchange of information from node I to node k. The subscript Ik in 
this illustration indicates that I is the source node and k is the sink node so that information is flowing from I to k. 



We assume that the following noise signals, which influence the data received by node k, 

{v k (i), „«(,•), v W f ,W} ( 4ii) 

are temporally white and spatially independent random processes with zero mean and variances or covariances 
given by 



f 2 2 



Tf( W ) rWO n 

Zki ^v^ki n v,ik' JX 



Obviously, the quantities 



/ rr 2 P( W ) pW P(") \ 

^■u^fci n v,lki *V*fc! n-v/kj 



(412) 



(413) 



are all zero if £ ^ 7\4 or when l=k. We further assume that the noise processes (411) are independent of 
each other and of the regression data u m ,j for all k, £, m and i, j. 

9.2 Error Recursion 

Using the perturbed data (407)-(410), the adaptive diffusion strategy (201)-(203) becomes 

0k,i-i = }^ auk wtk,i-i (414) 



^ ax,tk wek,i-i 
4>k,i = <t>k,i-i + Mfc X! Qfc [ d « W ~ U *k,i<i>k,i-l] 



(415) 
(416) 
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Observe that the perturbed quantities {wgk,i-ii v,£k,i, d^k{i), V'ffc ,•}, with subscripts ik, appear in (414)- 
(416) in place of the original quantities {wt,i-i : ui^, de(i), ipi j} that appear in (201)-(203). This is because 
these quantities are now subject to exchange noises. As before, we are still interested in examining the 
evolution of the weight-error vectors: 



w k>i = w°-w k>i , k = l,2,...,N (417) 
For this purpose, we again introduce the following N x 1 block vector, whose entries are of size Mxl each: 



Wi 



A 



W2,i 



(418) 



and proceed to determine a recursion for its evolution over time. The arguments arc largely similar to what 
we already did before in Sec. 6.3 and, therefore, we shall emphasize the differences that arise. The main 
deviation is that we now need to account for the presence of the new noise signals; they will contribute 
additional terms to the recursion for Wi — see (442) further ahead. We may note that some studies on 
the effect of imperfect data exchanges on the performance of adaptive diffusion algorithms are considered 
in [59-61]. However, these earlier investigations were limited to particular cases in which only noise in the 
exchange of w^i-i was considered (as in (407)), in addition to setting C = I (in which case there is no 
exchange of {d^(i), m^,,}), and by focusing on the CTA case for which A2 = I. Here, we consider instead the 
more general case that accounts for the additional sources of imperfections shown in (408)-(410), in addition 
to the general diffusion strategy (201)-(203) with combination matrices {Ai, A2,C}. 
To begin with, we introduce the aggregate Mxl zero-mean noise signals: 



(w) A (w) 



,M>) 



(419) 



These noises represent the aggregate effect on node k of all exchange noises from the neighbors of node k 
while exchanging the estimates {wgj-i, j} during the two combination steps (201) and (203). The M x M 
covariance matrices of these noises are given by 



v,k - Z^i U v,lk> 









(420) 



These expressions aggregate the exchange noise covariances in the neighborhood of node k; the covariances 



are scaled by the squared coefficients {a\ 



l 2,lk 



}. We collect these noise signals, and their covariances, 



from across the network into JVx 1 block vectors and N x N block diagonal matrices as follows: 



O) 


A 


col{ 


(u>) 

«i,i-n 


WO 

v i 


A 


col{ 


WO 
v i,i . v 




A 


diag 




R w 


A 


diag 





R 



WO 

v,2 ' 



We further introduce the following scalar zero-mean noise signal 

vtk(i) = v t (i) + v$(i) - 



(u) o 

v ik,i w 



(421) 
(422) 
(423) 
(424) 

(425) 
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whose variance is 



'Ik 



jk 



w°*R[ u lw° 



(426) 



In the absence of exchange noises for the data \dg (i), ti^i}, the signal v^{i) would coincide with the measure- 
ment noise vi (i). Expression (425) is simply a reflection of the aggregate effect of the noises in exchanging 
{di(i), ui^i} on node k. Indeed, starting from the data model (208) and using (409)-(410), we can easily 
verify that the noisy data {dik(i), uik^} are related via: 



dik{i) 



uek.iW 



(427) 



We also define (compare with (234)-(235) and note that we are now using the perturbed regression vectors 

{U£k,i}) : 



diag{H' l i; JR' 2 ?; , 



It holds that 



where 



K = E ^ 


R u j - 


1 DC") 











When there is no noise during the exchange of the regression data, i.e., when R^J k 
{R' ki , K+, R'k) reduce to expressions (234)-(235) and (182) for {R k ,i, R k }. 
Likewise, we introduce (compare with (239)): 

Zk,i = E c ekU*ek,t v £k{i) 

Zl ± col{z u ,z 2ii ,...,z Nti } 



(428) 
(429) 

(430) 
(431) 

0, the expressions for 

(432) 
(433) 



Compared with the earlier definition for s; in (239) when there is no noise over the exchange links, we see 
that we now need to account for the various noisy versions of the same regression vector uej and the same 
signal d({i). For instance, the vectors ue k ,i and ui m ,i would denote two noisy versions received by nodes k 
and m for the same regression vector ut t i transmitted from node I. Likewise, the scalars dg k (i) and di m (i) 
would denote two noisy versions received by nodes k and m for the same scalar df (i) transmitted from node 
I. As a result, the quantity Zi is not zero mean any longer (in contrast to Si, which had zero mean). Indeed, 
note that 



Ez k:i = E °ik E«tt,i"ft(«) 



E c » E 



, («) 

U i,i + V ikr 



Mi) + vfkd) - v 



ZkA 



( E ^ <l) w ° 



(434) 
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It follows that 



Ez, 



£ on R {u) 



E 



w 



(435) 



Although we can continue our analysis by studying this general case in which the vectors Zi do not have 
zero-mean (see [62,63]), we shall nevertheless limit our discussion in the sequel to the case in which there is 
no noise during the exchange of the regression data, i.e., we henceforth assume that: 



(u) 
hkA 



0, R 



(«) 



0, U£k A = UP, 



(assumption from this point onwards) 



(436) 



We maintain all other noise sources, which occur during the exchange of the weight estimates {u^..;_i, xpg ,} 
and the data {dg(i)}. Under condition (436), we obtain 







^-2 _i_ ^-2 

a v,l + "v,(.k 



eeAfk 



(182) 



Rk 



(437) 
(438) 

(439) 



Then, the covariance matrix of each term z/~a is given by 




(440) 



and the covariance matrix of Zi is N x N block diagonal with blocks of size M x M: 



Z = Ez. t z* = diag{i? z4 , R Zi2 ,..., R Z)N } 



(441) 



Now repeating the argument that led to (246) we arrive at the following recursion for the weight-error vector: 

(noisy links) (442) 



t»< = Al (Jjvm - MK[) Alwi-x - A T 2 Mz t - A\ (I NM - M7L'^ - V W 



For comparison purposes, we repeat recursion (246) here (recall that this recursion corresponds to the case 
when the exchanges over the links are not subject to noise): 

Wi = A T 2 (Jjvm - MTU)A\ Wi-i - AlMC T s l (perfect links) (443) 

Comparing (442) and (443) we find that: 

(1) The covariance matrix Hi in (443) is replaced by TZ^. Recall from (429) that contains the influence 
of the noises that arise during the exchange of the regression data, i.e., the {Ry But smce we are 
assuming m this article that R^] k = 0, then K'^K^ 
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(2) The term C 7 Sj in (443) is replaced by Zj. Recall from (432) that contains the influence of the noises 
that arise during the exchange of the measurement data and the regression data, i.e., the {a^ ek , -R^ fc }- 

(3) Two new driving terms appear involving w^— l an d v i ■ These terms reflect the influence of the noises 
during the exchange of the weight estimates {_wt^-\,%j) t 

(4) Observe further that: 

(4a) The term involving v^_{ accounts for noise introduced at the information-exchange step (414) 
before adaptation. 

(4b) The term involving z.i accounts for noise introduced during the adaptation step (415). 

(4c) The term involving accounts for noise introduced at the information-exchange step (416) 
after adaptation. 

Therefore, since we are not considering noise during the exchange of the regression data, the weight-error 
recursion (442) simplifies to: 



A {Inm - MHi) A^Wi-! - AlMzi - A\ (I NM - MTU) v^_\ - vf> 



(noisy links) (444) 



where we used the fact that 72.' = 1Z; under these conditions. 



9.3 Convergence in the Mean 



Taking expectations of both sides of (444) we find that the mean error vector evolves according to the 
following recursion: 



Ewi = {Inm -Mn)Aj -Ewi-t, i>0 



(445) 



with 1Z defined by (181). This is the same recursion encountered earlier in (248) during perfect data 
exchanges. Note that had we considered noises during the exchange of the regression data, then the vector 
Zi in (444) would not be zero mean and the matrix Ti! i will have to be used instead of 72-i. In that case, the 
recursion for Eihi will be different from (445); i.e., the presence of noise during the exchange of regression 
data alters the dynamics of the mean error vector in an important way — see [62,63] for details on how to 
extend the arguments to this general case with a driving non-zero bias term. We can now extend Theorem 6.1 
to the current scenario. 

Theorem 9.1. (Convergence in the Mean,) Consider the problem of optimizing the global cost (92) with 
the individual cost functions given by (93). Pick a right stochastic matrix C and left stochastic matrices A\ 
and Ai satisfying (166). Assume each node in the network runs the perturbed adaptive diffusion algorithm 
(4-14-)-(4-16). Assume further that the exchange of the variables {wi^—i, j, dt (i)} is subject to additive 
noises as in (407), (408), and (410). We assume that the regressors are exchanged unperturbed. Then, all 
estimators {wk,i\ across the network will still converge in the mean to the optimal solution w° if the step-size 
parameters {fJ-k} satisfy 



(446) 



where the neighborhood covariance matrix is defined by (182). That is, Etu^j — > w° as i — > oo. 

□ 
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9.4 Mean-Square Convergence 

Recall from (264) that we introduced the matrix: 

B = Al {Inm - Mil) Aj 
We further introduce the N x N block matrix with blocks of size M x M each: 

U = Al(I N M-MTl) 



(447) 



(448) 



Then, starting from (444) and repeating the argument that led to (279) we can establish the validity of the 
following variance relation: 



w 



i-l \\Ta 



VCC 



(A%MZ T MA 2 ) + vec ( (wR^ )T U 



(449) 



for an arbitrary nonnegative-definite weighting matrix X with a = vec(S), and where T is the same matrix 
defined earlier cither by (276) or (277). We can therefore extend the statement of Theorem 6.7 to the present 
scenario. 

Theorem 9.2. (Mean- Square Stability ) Consider the same setting of Theorem 9.1. Assume sufficiently 
small step-sizes to justify ignoring terms that depend on higher powers of the step-sizes. The perturbed 
adaptive diffusion algorithm (414)-(416) * s mean-square stable if, and only if the matrix J- defined by (276), 
or its approximation (277), is stable (i.e., all its eigenvalues are strictly inside the unit disc). This condition 
is satisfied by sufficiently small step-sizes {/ife} that are necessarily smaller than the following bound: 



Amax(^?fc) 



(450) 



where the neighborhood covariance matrix is defined by (182). Moreover, the convergence rate of the 
algorithm is determined by [/>(£>)] 2 . 

□ 

We conclude from the previous two theorems that the conditions for the mean and mean-square convergence 
of the adaptive diffusion strategy are not affected by the presence of noises over the exchange links (under 
the assumption that the regression data are exchanged without perturbation; otherwise, the convergence 
conditions would be affected). The mean-square performance, on the other hand, is affected as follows. 
Introduce the N x N block matrix: 



imperfect 



A iT 



AlMZMA 2 + UK\ w) U* + 



(imperfect exchanges) 



(451) 



which should be compared with the corresponding quantity defined by (280) for the perfect exchanges case, 
namely, 

^perfect = A^MC 1 SCMA 2 (perfect exchanges) (452) 

When perfect exchanges occur, the matrix Z reduces to C T SC. We can relate 3^ m pcrfcct and ^perfect as 
follows. Let 



Tl (du) = diag \ CfV'n^i- c i2°li2Ruj, ■■■ c In°1,inRuJ > 



(453) 
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Then, using (438) and (441), it is straightforward to verify that 

z = c T sc + n {du) 

and it follows that: 

imperfect - ^perfect + J% MR^ MA 2 + Wl^H* + 
= ^perfect + 



(454) 



(455) 



Expression (455) reflects the influence of the noises {Bi w) , al lk }. Substituting the definition (451) 

into (449), and taking the limit as i — > oo, we obtain from the latter expression that: 



lim E\\w 



i\\{I-F)<j 



He (^perfect) ] Tcr 



(456) 



which has the same form as (284); therefore, we can proceed analogously to obtain: 





TV/rCTVlctwork 

^ imperfect 


1 

N ' 


7 eC (^perfect)] T ' i 1 - 


7-)- 1 


■ vec (I NM ) 


and 














T?A/TQT? nctwork 
£ J lVlS£v imperfoct - 


1 

N 


■ [ v ec (^ iporfcct )] T • {I 




1 • vec (TZ U ) 



Using (455), we see that the network MSD and EMSE deteriorate as follows: 



MSD 



network 
imperfect 



MSD 



network 
perfect 



1 ■ [vec (Ay T )] T • (I - J-)" 1 ■ vec (I NM ) 

l 



(457) 

(458) 

(459) 
(460) 



EMSE!™ k ct = EMSE~ t k + ^ ■ [vec (Ay T )Y ■ (I — J") -1 ■ vec (TZ U ) 
9.5 Adaptive Combination Weights 

We can repeat the discussion from Sees. 8.2 and 8.3 to devise one adaptive scheme to adjust the combination 
coefficients in the noisy exchange case. We illustrate the construction by considering the ATC strategy 
corresponding to A\ = In, A-2 = A, C = In, so that only weight estimates are exchanged and the update 
recursions are of the form: 



Tpk,i = w k,i-l + fJ>kUl ti [dk(i) —Uk,iWk,i-l] 



where from (408): 



Ck.i 



'ik.i 



In this case, the network MSD performance (457) becomes 

= ^I>(» 



tv Tcpv network 
ivl01 -' ate, C=I, imperfect 



atc,C=I^' atl =,imperfectS a t C: C=I 



(461) 
(462) 

(463) 
(464) 
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where, since now Z = S and TZ ( V W) = 0, we have 



S a tc,C=I 
, imperfect 

V 



i? 



v . k 
S 

M 
A 



A T (I - MK U ) 
A T MSMA + 

diagji?^, R^2-> ■•■■> R^n} 

diag{i? Ui i, i? Ui 2, • ■ • , -Ru,tv} 

diag{/ii/A/, M2^/, ■ ■ ■ , fJ-Nhi} 
A® I M 



(465) 
(466) 
(467) 
(468) 

(469) 
(470) 
(471) 
(472) 



To proceed, as was the case with (389), we consider the following simplified optimization problem 

min Tr(3^atc, imperfect) 

subject to A T t = 1, a lk > 0, a tk = if £ M k 
Using (466), the trace of y^tc, imperfect can be expressed in terms of the combination coefficients as follows 



(473) 



N N 



Tr(34tc,impcrfect) = X/ X fl « (4 a l,t Tt ( R u,() + Tr \R$ k 
k=l 1=1 

so that problem (473) can be decoupled into N separate optimization problems of the form: 



(474) 



N 








min > 

{aiki e=1 e=1 


a ik tfol.i Tr(i? M ,f) - 


fTr(i?W);, 


k = l,...,N 


subject to 


N 






aik > 0, 


^ atk = 1, aik 


= if I i A4 





(475) 



With each node £, we associate the following nonnegative variance products: 



ilk = A ■ o*. ■ Tr(Ru,t) + Tr C , k S M 



(476) 



This measure now incorporates information about the exchange noise covariances {R^i k }. Then, the solution 
of (475) is given by: 







if I e M k 






aik = < 






[o, 


otherwise 



(relative- variance rule) 



(477) 



We continue to refer to this combination rule as the relative-variance combination rule [58]; it leads to a 
left-stochastic matrix A. To evaluate the combination weights (477), the nodes need to know the variance 

^mk i 



products, {7^ fe }, of their neighbors. As before, we can motivate one adaptive construction as follows. 
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We refer to the ATC recursion (461)-(463) and use the data model (208) to write for node I: 



Ik A 



W 



+ iJ, e uh[ui t iWt ti -i +Vi(i)] + 



,M>) 

'Ik,: 



(478) 



so that, in view of our earlier assumptions on the regression data and noise in Sees. 6.1 and 9.1, we obtain 
in the limit as i — > oo: 



lim 



Wtk,i ~ w e,i-i\\ = Mr lim E||«;i_i 



'E(«J ii ||u/, i || 3 tt/, i ) 



■ al e ■ Tr(R u j) + Tr (r[% ) (479) 



In a manner similar to what was done before for (396), we can evaluate the limit on the right-hand side by 
using the corresponding steady-state result (456). We select the vector a in (456) to satisfy: 



{I-T)o = vec [E (i^Ju^H 2 ^)] 



(480) 



Then, from (456), 



[vec (3>J tc , 



imperfect 



)] • (I - T)- 1 • vec [E (ukK.ill V*)] (481) 



ll"f,il| 2, U£,i) 

Now recall from expression (466) that the entries of 3^atc, imperfect depend on combinations of the squared 

step-sizes, {/ij,, m = 1,2,... , N}, and on terms involving {Tr(i?«)}. This fact implies that the first 

term on the right-hand side of (479) depends on products of the form {/if/ij^}; these fourth-order factors can 
be ignored in comparison to the second-order factor /x| for small step-sizes. Moreover, the same first term 

on the right-hand side of (479) depends on products of the form |/i^Tr (^R$C\ j, which can be ignored in 
comparison to the last term, Tr 

( R $k) > in ( 479 )' which docs not 

appear multiplied by a squared step-size. 

Therefore, we can approximate: 



lim E i/' 



ik.i 



We 



Ilk 



(482) 



in terms of the desired variance product, j^ k . Using the following instantaneous approximation at node k 
(where u^,j-i is replaced by WkA-i)'- 



IE 



tk. 



\\i>ik,i - WkA-ltf 



(483) 



we can motivate an algorithm that enables node k to estimate the variance products j^ k . Thus, let jf k (i) 
denote an estimate for 7^. that is computed by node k at time i. Then, one way to evaluate 7^ fc («) is through 
the recursion: 



llk{i) = C 1 - v k) ■ lIS - !) + v k ■ \\ipek,i - wjm-iII 5 



where is a positive coefficient smaller than one. Indeed, it can be verified that 

lim Kj 2 ek (i) « 7f fc 



(484) 



(485) 



so that the estimator 7^(2) converges on average close to the desired variance product yf k . In this way, we 
can replace the weights (477) by the adaptive construction: 









atk(i) = < 




if i G Afk 




[0, 


otherwise 



(486) 



Equations (484) and 



provide one adaptive construction for the combination weights {agk}- 
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10 Extensions and Further Considerations 



Several extensions and variations of diffusion strategies are possible. Among those variations we mention 
strategies that endow nodes with temporal processing abilities, in addition to their spatial cooperation 
abilities. We can also apply diffusion strategies to solve recursive least-squares and state-space estimation 
problems in a distributed manner. In this section, we highlight select contributions in these and related 
areas. 



10.1 Adaptive Diffusion Strategies with Smoothing Mechanisms 

In the ATC and CTA adaptive diffusion strategies (153)-(154), each node in the network shares information 
locally with its neighbors through a process of spatial cooperation or combination. In this section, we describe 
briefly an extension that adds a temporal dimension to the processing at the nodes. For example, in the 
ATC implementation (153), rather than have each node k rely solely on current data, {de(i),ue,i, £ £ A4}, 
and on current weight estimates received from the neighbors, {iptj, I £ node k can be allowed to 

store and process present and past weight estimates, say, P of them as in {tpe,j, j = — 1, . . . ,i — P + 1}. 
In this way, previous weight estimates can be smoothed and used more effectively to help enhance the 
mean-square-deviation performance especially in the presence of noise over the communication links. 

To motivate diffusion strategies with smoothing mechanisms, we continue to assume that the random 
data {dk(i),Uk,i} satisfy the modeling assumptions of Sec. 6.1. The global cost (92) continues to be the 
same but the individual cost functions (93) are now replaced by: 

p-i 

J k (w) = ^2 qkj E \d k (i - j) - Uk,%-j w\ 2 (487) 

3=0 

so that 

N /p-1 \ 

J gloh (w) = £ ffki 1 \dk{i - j) - u k!i ^ w\ 2 \ (488) 
fc=i \j=o J 

where each coefficient q k j is a non-negative scalar representing the weight that node k assigns to data from 
time instant i — j. The coefficients {qkj} are assumed to satisfy the normalization condition: 

p-i 

fto>0, = 1. k = l,2,...,N (489) 

3=0 

When the random processes dk(i) and Uk.i are jointly wide-sense stationary, as was assumed in Sec. 6.1, the 
optimal solution w° that minimizes (488) is still given by the same normal equations (40). We can extend the 
arguments of Sees. 3 and 4 to (488) and arrive at the following version of a diffusion strategy incorporating 
temporal processing (or smoothing) of the intermediate weight estimates [73,74]: 

<Pk,i = w k ,i-i + [ik cik qio u},i [dt{i)-u^iWk,i-i] (adaptation) (490) 
p-i 

ipk.i = f k j 4>k,i-j (temporal processing or smoothing) (491) 

w k ,i = agk V*£,i (spatial processing) (492) 
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where the nonnegative coefficients {cek,aek, fkj,qio} satisfy: 

forfc= 1,2,..., TV; 



iY 



Q fc > 0, c ?k = 1, c*k = if £ i M k 
fc=i 

N 

a-ik > 0, Y a lk = 1, a ek = if £ £ N k 

p-i 

hi > o, X = 1 

3=0 

< <&!„ < 1 



(493) 

(494) 

(495) 
(496) 



Since only the coefficients {qi } are needed, we alternatively denote them by the simpler notation {qg} in 
the listing in Table 8. These are simply chosen as nonnegative coefficients: 



0<q e <l, £ = 1,2,.. 



(497) 



Note that algorithm (490)- (492) involves three steps: (a) an adaptation step (A) represented by (490); (b) a 
temporal filtering or smoothing step (T) represented by (491), and a spatial cooperation step (S) represented 
by (492). These steps are illustrated in Fig. 15. We use the letters (A,T,S) to label these steps; and we use the 
sequence of letters (A,T,S) to designate the order of the steps. According to this convention, algorithm (490)- 
(492) is referred to as the ATS diffusion strategy since adaptation is followed by temporal processing, which 
is followed by spatial processing. In total, we can obtain six different combinations of diffusion algorithms 
by changing the order by which the temporal and spatial combination steps are performed in relation to the 
adaptation step. The resulting variations are summarized in Table 8. When we use only the most recent 
weight vector in the temporal filtering step (i.e., set ipk.i = 4>k.i), which corresponds to the case P = 1, the 
algorithms of Table 8 reduce to the ATC and CTA diffusion algorithms (153) and (154). Specifically, the 
variants TSA, STA, and SAT (where spatial processing S precedes adaptation A) reduce to CTA, while the 
variants TAS, ATS, and AST (where adaptation A precedes spatial processing S) reduce to ATC. 



Adaptation (A) 



{di(i),u 1A } 



{d e (i),u eA } 



{ctk,<u}- 
{d e (i),u tli }- 



<t>k,% 



rv: 



Temporal Processing (T) Spatial Processing (S) 




{4>k,i-j} 



<t>k,i-P+l 









* 


adaptation 




{/«}-> 


filtering 












We,k} ■ 



aggregation 



Wk,; 



Figure 15: Illustration of the three steps involved in an ATS diffusion strategy: adaptation, followed by temporal 
processing or smoothing, followed by spatial processing. 



The mean-square performance analysis of the smoothed diffusion strategies can be pursued by extending 
the arguments of Sec. 6. This step is carried out in [73, 74] for doubly stochastic combination matrices A 
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when the filtering coefficients {fkj} do not change with k. For instance, it is shown in [74] that whether 
temporal processing is performed before or after adaptation, the strategy that performs adaptation before 
spatial cooperation is always better. Specifically, the six diffusion variants can be divided into two groups 
with the respective network MSDs satisfying the following relations: 

Group #1 : MSD^* = MSD^° rk > MSD^ ork (498) 

Group #2 : MSDg^ ork > MSD^ ork = MSD^ ork (499) 

Note that within groups 1 and 2, the order of the A and T operations is the same: in group 1, T precedes A 
and in group 2, A precedes T. Moreover, within each group, the order of the A and S operations determines 
performance; the strategy that performs A before S is better. Note further that when P = 1, so that 
temporal processing is not performed, then TAS reduces to ATC and TSA reduces to CTA. This conclusion 
is consistent with our earlier result (343) that ATC outperforms CTA. 



Table 8: Six diffusion strategies with temporal smoothing steps. 



TSA diffusion: 

P-l 

4*k,i-l = ^ fkj Wk,i-j-l 

j=o 

ipk,i-i = ^2 atk 4>t,i-i 
leJV h 

w k ,i = "0/M-i + Mfc X! qe ° ek u *,i [^(*) ~ Ut ' i ^M-l] 


TAS diffusion: 

P-l 

<f>k,i-i = X] fa W k,i-j-l 

3=0 

1pk,i = 4>k,i-l + Mfe X! qi Clk U ti [^d) ~ u t,i<t>k,i-l] 

Wk,i = 22 a Wipi,i 


STA diffusion: 

<f>k,i-i = a-tk Wi t i-i 

P-l 

1pk,i-l = /J fkj (j>k,i-j-l 
3=0 

W k ,i = 4>k,i-\ + Mfc qi Cik U Xi ~~ U^ilpkA-l] 

teM k 


ATS diffusion: 

(f>k,i = V>k,i-l + *^2 qe Ctk U ti [^(*) ~~ u l,i w k,i-l\ 

p-i 

i>k,i = *^2 fkj4>k,i-3 

3=0 

WkS = ^2 a ^i,i 


SAT diffusion: 

<j>k,i-l = X! a ?kWt,i-l 

1pk,i = <f>k,i-l + Mfc X! qe Ctk U ti [did) ~ u l,i<t>k,i-l] 
P-l 

Wk,i = fkj4>k,i-j 
3=0 


AST diffusion: 

4>k,i = Wk,i-i + Vk^2 qe cik u* ei [di(i) - ut ti Wk,i-i] 

ipk,i = *^2 a «^>« 

p-i 

V>k,i = X! fkj1pk,i-j 
3=0 



In related work, reference [75] started from the CTA algorithm (159) without information exchange and 
added a useful projection step to it between the combination step and the adaptation step; i.e., the work 
considered an algorithm with an STA structure (with spatial combination occurring first, followed by a 
projection step, and then adaptation). The projection step uses the current weight estimate, 4>k,i, at node 
k and projects it onto hyperslabs defined by the current and past raw data. Specifically, the algorithm 
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from [75] has the following form: 




p-i 



J=0 



(500) 
(501) 
(502) 



where the notation ?/> = T-'fc,?: [0] refers to the act of projecting the vector cj) onto the hyperslab P k j that 
consists of all M x 1 vectors z satisfying (similarly for the projection V ki ): 



P, 



k . i 



k,i 



A 
A 



{ z such that |d fc (i) - u k jz\ < e k } 
{ z such that \d k (i) — u k ^z\ < } 



(503) 
(504) 



where {e k ,e' k } are positive (tolerance) parameters chosen by the designer to satisfy e' k > e k . For generic 
values {d, u, e}, where <i is a scalar and u is a row vector, the projection operator is described analytically 
by the following expression [76] : 



u<p] 



+ < 



if d — e > uo 



if \d — ucf>\ < e 



if d + e < uq 



(505) 



The projections that appear in (501)-(502) can be regarded as another example of a temporal processing 
step. Observe, however, from the middle plot in Fig. 15 that the temporal step that we are considering in 
the algorithms listed in Table 8 is based on each node k using its current and past weight estimates, such as 
{4>k,i, 4>k,i—ii ■ ■ ■ j 0fc,i--P+i}j rather than only and current and past raw data {d k (i), d k (i — 1), . . . , d k (i — 
P + 1), Uki, Ufc i-i, ■ • ■ , u ki _p + i}. For this reason, the temporal processing steps in Table 8 tend to exploit 
information from across the network more effectively and the resulting mean-square error performance is 
generally improved relative to (500)-(502). 

10.2 Diffusion Recursive Least-Squares 

Diffusion strategies can also be applied to recursive least-squares problems to enable distributed solutions of 
least-squares designs [28,29]. Thus, consider again a set of N nodes that are spatially distributed over some 
domain. The objective of the network is to collectively estimate some unknown column vector of length 
M, denoted by w°, using a least-squares criterion. At every time instant i, each node k collects a scalar 
measurement, d k {i), which is assumed to be related to the unknown vector w° via the linear model: 



dk{i) 



Uk,iW 



Vk{i) 



(506) 



In the above relation, the vector Uk,i denotes a row regression vector of length M, and Vk(i) denotes measure- 
ment noise. A snapshot of the data in the network at time i can be captured by collecting the measurements 
and noise samples, {d k (i), v k (i)}, from across all nodes into column vectors yi and Vi of sizes N x 1 each, 
and the regressors {u k ,i} into a matrix Hi of size N x M: 



di(i) 
d N (i) 



(N x 1), Vl 



vx(i) 
v N (i) 



(N x 1), Hi 



U2,i 
UN,; 



(N x M) (507) 
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Likewise, the history of the data across the network up to time i can be collected into vector quantities as 
follows: 



(508) 



Then, one way to estimate w° is by formulating a global least-squares optimization problem of the form: 





Vi 




Vi 




Hi 




Vi-i 




Vi-1 






y t = 




, Vi = 




) Hi — 






yo 








Ho 



min \\w\\l + \\y q - HHIw 



(509) 



where Hi > represents a Hermitian regularization matrix and Wi > represents a Hermitian weighting 
matrix. Common choices for II, and W,- are 



W 4 = diag{/Ar, XI N , \ l I N } 



(510) 
(511) 



where <5 > is usually a large number and < A < 1 is a forgetting factor whose value is generally very 
close to one. In this case, the global cost function (509) can be written in the equivalent form: 




(512) 



which is an exponentially weighted least-squares problem. We see that, for every time instant j, the squared 
errors, \dk(j) — u^ jw] 2 , are summed across the network and scaled by the exponential weighting factor X l ~ : > . 
The index i denotes current time and the index j denotes a time instant in the past. In this way, data 
occurring in the remote past are scaled more heavily than data occurring closer to present time. The global 
solution of (509) is given by [5]: 

Wi = [n l + n l w l H l r 1 H*w l y i (513) 

and the notation Wi, with a subscript i, is meant to indicate that the estimate Wi is based on all data 
collected from across the network up to time i. Therefore, the Wi that is computed via (513) amounts to a 
global construction. 

In [28, 29] a diffusion strategy was developed that allows nodes to approach the global solution Wi by 
relying solely on local interactions. Let w^.i denote a local estimate for w° that is computed by node k at 
time i. The diffusion recursive-least-squares (RLS) algorithm takes the following form. For every node k, 
we start with the initial conditions Wfe^i = and Pk,-i = SIm, where Pk,-i is an M x M matrix. Then, 
for every time instant i, each node k performs an incremental step followed by a diffusion step as follows: 
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Diffusion RLS. 



Step 1 (incremental update) 
ipk,i <~ WkA-i 
Pk,i <- A _1 Pfe,i_i 

for every neighboring node ^ S A4 , update : 

Ci k Pk,iUli 

Vk,i <~ i'k.i + ~Z ! n 5T - Ut,ii)k,i\ 

1 + c ek u eii P kii u e i 

(514) 

CikPk,lU* ti Ut^Pk,i 

*k,i 4 ^fe.j — ; „ i 

1 + ClkUi :i P k} iU ei 

end 



Step 2 (diffusion update) 

tOk.i = E aekipe,i 



where the symbol <— denotes a sequential assignment, and where the scalars {aek,cgk} are nonnegative 
coefficients satisfying: 

for jfe = 1, 2, . . . , N : 

N 

ctk > 0, °tk = 1, cik = if £ i M k (515) 
fe=i 

JV 

a tt >0, 5^o <fc = l, a tt = if * g JV fc (516) 

The above algorithm requires that at every instant i, nodes communicate to their neighbors their mea- 
surements {de(i) , U£j} for the incremental update, and the intermediate estimates {tp£,i} for the diffusion 
update. During the incremental update, node k cycles through its neighbors and incorporates their data 
contributions represented by {di(i),U£ t i} into {ipk,i, Pk,i}- Every other node in the network is performing 
similar steps. At the end of the incremental step, neighboring nodes share their intermediate estimates 
{i ) l,i\ to undergo diffusion. Thus, at the end of both steps, each node k would have updated the quantities 
{wk,i-i, Pk.i-i} to {wk,i, Pk,i}- The quantities P k ,i are matrices of size M x M each. Observe that the 
diffusion RLS implementation (514) does not require the nodes to share their matrices {Pa}, which would 
amount to a substantial burden in terms of communications resources since each of these matrices has M 2 
entries. Only the quantities {di(i), it^,-, are shared. The mean-square performance and convergence of 
the diffusion RLS strategy are studied in some detail in [29]. 

The incremental step of the diffusion RLS strategy (514) corresponds to performing a number of |A4| 
successive least-squares updates starting from the initial conditions {wk^-i, Pk.i-i} and ending with the 
values {ipk.i, Pk.i} that move on to the diffusion update step. It can be verified from the properties of 
recursive least-squares solutions [4, 5] that these variables satisfy the following equations at the end of the 
incremental stage (step 1): 

PkJ = ^k,li + E <**«w ( 5l7 ) 

p k,i'<l ) k,i = ^Pk,l-i w k,i-i + E c £kU*i,ide(i) (518) 
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Introduce the auxiliary Mxl variable: 



9k,i = P*li>k,i (519) 



Then, the above expressions lead to the following alternative form of the diffusion RLS strategy (514). 

Alternative form of diffusion RLS. 



Wk,i-1 = a ^l,i-l 

ipk.i = Pk,iqk,i 



Under some approximations, and for the special choices A = C and A = 1, the diffusion RLS strategy 
(520) can be reduced to a form given in [79] and which is described by the following equations: 



p-i 

r kA 



Qk,; 
i>k,; 



J2 cek 



P 



E 



cik [q 

PkMk.i 



Li-1 



A-X 



(521) 
(522) 
(523) 



Algorithm (521)-(523) is motivated in [79] by using consensus-type arguments. Observe that the algorithm 
requires the nodes to share the variables {dg{i), u^i, qe,i-i, Pia-x), which corresponds to more communica- 
tions overburden than required by diffusion RLS; the latter only requires that nodes share {dg(i), ug t i, 
In order to illustrate how a special case of diffusion RLS (520) can be related to this scheme, let us set 



A = C and A = 1 

Then, equations (520) give: 

Special form of diffusion RLS when A — C and A = 1. 



(524) 



WkA-X = 2J c tkil>t 



i-X 



P 



-l 

k . / 



(525) 



ipk,i = Pk,iqk,i 
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Comparing these equations with (521)-(523), we find that algorithm (521)-(523) of [79] would relate to the 
diffusion RLS algorithm (520) when the following approximations arc justified: 



eeAf k 



P 



kA-l 



(526) 



p k.l-i Yl c ^i,i-i 

PkLl w k,i-l 



(527) 
(528) 



It was indicated in [29] that the diffusion RLS implementation (514) or (520) leads to enhanced performance 
in comparison to the consensus-based update (521)— (523). 



10.3 Diffusion Kalman Filtering 

Diffusion strategies can also be applied to the solution of distributed state-space filtering and smoothing 
problems [30,31,33]. Here, we describe briefly the diffusion version of the Kalman filter; other variants and 
smoothing filters can be found in [33] . We assume that some system of interest is evolving according to linear 
state-space dynamics, and that every node in the network collects measurements that are linearly related to 
the unobserved state vector. The objective is for every node to track the state of the system over time based 
solely on local observations and on neighborhood interactions. 

Thus, consider a network consisting of N nodes observing the state vector, Xi, of size n x 1 of a linear 
state-space model. At every time i, every node k collects a measurement vector y k i of size pxl, which is 
related to the state vector as follows: 



x i+ i = FiXi + drii 
y k .i = //->-''- • '•/...• k = l,2,...,N 



(529) 
(530) 



m 




rij 


A 


' Qi 






. V k,3 _ 




R k)i 



The signals m and Vk.i denote state and measurement noises of sizes nxl and pxl, respectively, and they 
are assumed to be zero-mean, uncorrelated and white, with covariance matrices denoted by 



(531) 



The initial state vector, x Q , is assumed to be zero- mean with covariance matrix 

Ex a xl = n o > 

and is uncorrelated with rii and Vk,i, for all i and k. We further assume that Rk,i > 0. 
matrices {Fi, Gi, H^^, Qi, Rk,i, IT} are assumed to be known by node fc. 

Let Xh^tj denote a local estimator for Xi that is computed by node k at time i based solely on local 
observations and on neighborhood data up to time j. The following diffusion strategy was proposed in 
[30,31,33] to evaluate predicted and filtered versions of these local estimators in a distributed manner for 
data satisfying model (529)-(532). For every node k, we start with Xk,o\-i = an d -Pfe.o|-i = n o , where 
Pk,o\-i is an M x M matrix. At every time instant i, every node k performs an incremental step followed 



(532) 

The parameter 
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by a diffusion step: 

Time and measurement-form of the diffusion Kalman filter. 



Step 1 (incremental update) 

^k,i <~ &k,i\i-l 
Pk,i Pk,i\i-1 

for every neighboring node I € Mk , update : 

R e <— Re,i + He.,iPk,iHi i 

Pk.i <~ ffc.i — Pk,iH£ iRe 1 Hl,iPk.i frqq^ 

end (5 ^ j 



Step 2 (diffusion update) 

Pk 7 i\i — Pk,i 

x k,i+l\i — Pi x k,i\i 

Pk,i+l\i = FiPkA\iF* + GiQiG*. 



where the symbol denotes a sequential assignment, and where the scalars {a^} are nonnegative coefficients 
satisfying: 

for k = 1,2,...,N : 

N 

a-ik > 0, a '* = 1, a ik = if t £ N k (534) 

i=\ 

The above algorithm requires that at every instant i, nodes communicate to their neighbors their mea- 
surement matrices Hi t i, the noise covariance matrices Ri.i, and the measurements y ti for the incremental 
update, and the intermediate estimators i/^ i for the diffusion update. During the incremental update, node 
k cycles through its neighbors and incorporates their data contributions represented by {y e -ff^i, Re,i} into 
{xp k i , Pk,i}- Every other node in the network is performing similar steps. At the end of the incremental step, 
neighboring nodes share their updated intermediate estimators {i/^ j} to undergo diffusion. Thus, at the 
end of both steps, each node k would have updated the quantities {xk t i\i— i, Pk,i\i— 1} to {x^j+iu, Pk,i+i\i\- 
The quantities Pk,%\i-i are n x n matrices. It is important to note that even though the notation P k ,i\i 
and Pk,i\i-i has been retained for these variables, as in the standard Kalman filtering notation [5,77], these 
matrices do not represent any longer the covariances of the state estimation errors, Xk,i\i-i = x i ~~ ^fc.;|i-i? 
but can be related to them [33]. 

An alternative representation of the diffusion Kalman filter may be obtained in information form by 
further assuming that Pk,m—i > for all k and i; a sufficient condition for this fact to hold is to requires 
the matrices {Fi} to be invertible [77]. Thus, consider again data satisfying model (529)-(532). For every 
node k, we start with Xk,o\-i — and -P^oi— l = n^ 1 - At every time instant i, every node k performs an 
incremental step followed by a diffusion step: 
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Information form of the diffusion Kalman filter. 



Step 1 (incremental update) 
Sk,i = ^ H ti R ll H £,i 

^M k 

i>k,i = &k,i\i-l + Pk,i\i [lk,i ~ Sk,iX k ,i\i-l] 



Step 2: (diffusion update) 

x k,i+l\i Fi x k,i\i 

Pk,i+l\i = FiPk,i\iF* + GiQiG* 



The incremental update in (535) is similar to the update used in the distributed Kalman filter derived 
in [48]. An important difference in the algorithms is in the diffusion step. Reference [48] starts from a 
continuous-time consensus implementation and discretizes it to arrive at the following update relation: 

Xk,i\i = ^k,i + e ~ ^<m) ( 536 ) 

which, in order to facilitate comparison with (535), can be equivalently rewritten as: 

x kAi = (l + e-n k e)-ip k)i + ^ e-^ (537) 

where n k denotes the degree of node k (i.e., the size of its neighborhood, Af k ). In comparison, the diffusion 
step in (535) can be written as: 

x k ,i\i = flfcfc • ip k ,i + z2 aih- (538) 
eeAT k \{k} 

Observe that the weights used in (537) are (1 + e — n k e) for the node's estimator, ijj k i: and e for all other 
estimators, {ipa}, arriving from the neighbors of node k. In contrast, the diffusion step (538) employs a 
convex combination of the estimators {tftgi} with generally different weights {ae k } for different neighbors; 
this choice is motivated by the desire to employ combination coefficients that enhance the fusion of infor- 
mation at node k, as suggested by the discussion in App. D of [33]. It was verified in [33] that the diffusion 
implementation (538) leads to enhanced performance in comparison to the consensus-based update (537). 
Moreover, the weights {ai k } in (538) can also be adjusted over time in order to further enhance performance, 
as discussed in [78] . The mean-square performance and convergence of the diffusion Kalman filtering imple- 
mentations are studied in some detail in [33], along with other diffusion strategics for smoothing problems 
including fixed-point and fixed-lag smoothing. 
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10.4 Diffusion Distributed Optimization 

The ATC and CTA steepest-descent diffusion strategies (134) and (142) derived earlier in Sec. 3 provide 
distributed mechanisms for the solution of global optimization problems of the form: 



N 



(539) 



(540) 



k=l 

where the individual costs, Jk{w), were assumed to be quadratic in w, namely, 

J k (w) = a 2 d k - w*r dUik - r duk w + w*R Uik w 

for given parameters {a\ k , r du ,k, Ru,k}- Nevertheless, we remarked in that section that similar diffusion 
strategies can be applied to more general cases involving individual cost functions, J k (w), that are not 
necessarily quadratic in w [1,2]. We restate below, for ease of reference, the general ATC and CTA diffusion 
strategies (139) and (146) that can be used for the distributed solution of global optimization problems of 
the form (539) for more general convex functions J k (w): 



(ATC strategy) 



1pk,i = Wk,i- 








wks = 2_j 


aik ipt,i 







and 



(CTA strategy) 



ipk,i-i = y] 


d£k Wi t i-1 






Wk,i = Vfc,*— 1 


- fik 2J c *k [^wJe(ipk,i-i)}* 







(541) 



(542) 



for positive step-sizes {nk} and for nonncgativc coefficients {c£ k ,ai k } that satisfy: 

forfc = l,2,...,iV: 

N 

cik > 0, c «* = 1, c tk = if £ $ Af k 



k=l 
N 



aek 



>0, ^a ft = l, a ek = 0ife^M k 



e=i 



(543) 



That is, the matrix A = [a£k] is left-stochastic while the matrix C = [cf k ] is right-stochastic: 

Ct = l, A T t = 1 (544) 

We can again regard the above ATC and CTA strategies as special cases of the following general diffusion 
scheme: 



ipk,i 



Wk, 



<t>k,%-\ - Mfc c £k [^ W Jl((f>k,i-l)T 



(545) 
(546) 
(547) 
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where the coefficients {ai^fc, 0,2, tk, c-ik} are nonnegative coefficients corresponding to the (Z, fc)-txi entries of 
combination matrices {A\ 1 A2,C} that satisfy: 

Ajt = 1, Alt = 1, Ct = t (548) 

The convergence behavior of these diffusion strategies can be examined under both conditions of noiseless 
updates (when the gradient vectors are available) and noisy updates (when the gradient vectors are subject 
to gradient noise). The following properties can be proven for the diffusion strategies (545)-(547) [1,2]. The 
statements that follow assume, for convenience of presentation, that all data are real-valued; the conditions 
would need to be adjusted for complex-valued data. 



N 

J eloh H = ^JfcM (549) 

k=l 

denote the global cost function that we wish to minimize. Assume J gloh (w) is strictly convex so that its 
minimizer 10° is unique. Assume further that each individual cost function Jk(w) is convex and has a mini- 
mizer at the same w°. This case is common in practice; situations abound where nodes in a network need to 
work cooperatively to attain a common objective (such as tracking a target, locating the source of chemical 
leak, estimating a physical model, or identifying a statistical distribution). The case where the { Jk(w)} have 
different individual minimizers is studied in [3] , where it is shown that the same diffusion strategies of this 
section are still applicable and nodes would converge instead to a Pareto-optimal solution. 



Noiseless Updates 

Let 



Theorem 10.1. (Convergence to Optimal Solution: Noise- Free Case) Consider the problem of 
minimizing the strictly convex global cost (549), with the individual cost functions {Jk(w)} assumed to be 
convex with each having a minimizer at the same w° . Assume that all data are real-valued and suppose the 
Hessian matrices of the individual costs are bounded from below and from above as follows: 

Xi, min hi < VlMw) < A £ , max / M , l = l,2,...,N (550) 

for some positive constants {A^ m i n , Af jmax }. Let 

Cfc,min = ^2 QfcA^min, Cfc imax = ^ C£fcAf jmax (551) 

Assume further that o~k,min > and that the positive step-sizes are chosen such that: 

l^k<—^—, k = l,...,N (552) 

^fe,max 

Then, it holds that Wk,i — > w° as i — » oo. That is, the weight estimates generated by (545)-(547) at all nodes 
will tend towards the desired global minimizer. 

□ 



We note that in works on distributed subgradient methods (e.g., [40,80]), the norms of the subgradients are 
usually required to be uniformly bounded. Such a requirement is restrictive in the unconstrained optimiza- 
tion of differentiablc functions. Condition (550) is more relaxed since it allows the gradient vector V w Jg(w) 
to have unbounded norm. This extension is important because requiring bounded gradient norms, as opposed 
to bounded Hessian matrices, would exclude the possibility of using quadratic costs for the Ji(w) (since the 
gradient vectors would then be unbounded). And, as we saw in the body of the chapter, quadratic costs 
play a critical role in adaptation and learning over networks. 
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Updates with Gradient Noise 

It is often the case that we do not have access to the exact gradient vectors to use in (546), but to noisy 
versions of them, say, 

V w Jt{(j>k,i-i) = V w Je(<f>k,i-i) + vi(fa,i-i) (553) 

where the random vector variable vt(-) refers to gradient noise; its values is generally dependent on the 
weight-error vector realization, 

4>k,i-i = w° - (f>k,i-i (554) 

at which the gradient vector is being evaluated. In the presence of such gradient noises, the weight estimates 
at the various nodes become random quantities and we denote them by the boldface notation {wk,i}- We 
assume that, conditioned on the past history of these weight estimators at all nodes, namely, 

Fi-i = {w m j, m = 1,2, ...,N, j < i} (555) 
the gradient noise has zero mean and its variance is upper bounded as follows: 

E [vSk,i-i) I ^i-i} = (556) 

E {\\vSk,i-i)\\ 2 \ Fi-i} < a||?fc,i-i || 2 + o-l (557) 

for some a > and ct 2 , > 0. Condition (557) allows the variance of the gradient noise to be time-variant, 
so long as it does not grow faster than ||<^fc,i_i || 2 . This condition on the noise is more general than the 
"uniform-bounded assumption" that appears in [40], which required instead: 



3 



{\\v e (4>k,i-i)\\ 2 } < °l E {\\vi((f>k,i-i)\\ 2 |^-i} <ol (558) 



These two requirements are special cases of (557) for a = 0. Furthermore, condition (557) is similar to 
condition (4.3) in [81], which requires the noise variance to satisfy: 



E 



{\\vSk,i-i)\\ 2 I •Fi-i} < a [HV^O^-OII 2 + 1] (559) 



This requirement can be verified to be a combination of the "relative random noise" and the "absolute 
random noise" conditions defined in [22] — see [2]. 
Now, introduce the column vector: 

N 

Zi = co\{cnvt{(j>k t i-i), cnvi((j)2,i-i), CiNve((/)N,i-i)} (560) 

and assume its covariance matrix satisfies 

Z = EziZ* < a-E\\wi-x\\ 2 ■ Q + Z (561) 
for some a > 0, Q > 0, and Z > 0, and where Wi-% is the network block error vector at time i — 1: 

Wi = co\{w iAl Wij,...,Wi,N} (562) 

where 

Wk,i = w° - w k s (563) 

Condition (561) is an extension of (557) to the vector z^. Then, the following result can be established [2]; 
it characterizes the network mean-square deviation in steady-state, which is defined as 

MSD nctwork A ^ , ;E||G M || a I 
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Theorem 10.2. (Mean- Square Stability: Noisy Case) Consider the problem of minimizing the strictly 
convex global cost (54-9), with the individual cost functions {Jk{w)} assumed to be convex with each having a 
minimizer at the same w° . Assume all data are real-valued and suppose the Hessian matrices of the individual 
costs are bounded from below and from above as stated in (550). Assume further that the diffusion strategy 
(54-5)-(54-7) employs noisy gradient vectors, where the noise terms are zero mean and satisfy conditions 
(557) and (561). We select the positive step-sizes to be sufficiently small and to satisfy: 

.max ,mm I /mr\ 

c + «icir< mm ioiqis | (565) 

for k = 1,2, ...,N. Then, the diffusion strategy (545)-(547) is mean-square stable and the mean-square- 
deviation of the network is given by: 

M g D network „ i_ [ yec (^^^T^^^ ■ (J — J 7 ) -1 • ve C (I NM ) (566) 

where 

A 2 = 
M = 
T w 
B = 

n = 

□ 



A 2 ® I M 

diag{/ii7M, ^2hi, 
B T ®B* 

A\ (I - MR) A\ 



N 

^diag 

1=1 



■ , UnIm} 



c eN V 2 w Mw°)} 



(567) 
(568) 
(569) 
(570) 

(571) 
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A Appendix: Properties of Kronecker Products 



For ease of reference, we collect in this appendix some useful properties of Kronecker products. All matrices 
are assumed to be of compatible dimensions; all inverses are assumed to exist whenever necessary. Let 
E = [ey]™j =1 and B = [6ij]Jj = i betixn and m x m matrices, respectively. Their Kronecker product is 
denoted by E ® B and is defined as the nm x nm matrix whose entries are given by [20] : 



E®B 



e\\B ei2-B 
e 2 i-B e 22 B 

e n \B e„2-B 



ei n B 

G rt.m B 



(572) 



In other words, each entry of E is replaced by a scaled multiple of B. Let {Xi(E),i = l,...,n} and 
{Xj(B),j — 1, . . . , m} denote the eigenvalues of E and B, respectively. Then, the eigenvalues of E ® B will 
consist of all nm product combinations {Xi(E)Xj (B)}. Table 9 lists some well-known properties of Kronecker 
products. 



Table 9: Properties of Kronecker products. 



(E + B) ® C = (E ® C) + (B®C) 
(E ® B)(C ®D) = (EC ® BD) 

(E(g>B) T = E T ®B T 
(E <g> B)* =E* ®B* 
(E&B)- 1 = E- 1 ® B' 1 
(E ® B f = E e ®B e 

{X(E®B)} = {X i (E)X j (B)}ZZ=i 
det(E®B) = (detS) m (detB) n 

Tr(E ®B) = Tr(E)Tr(B) 
Ty(EB) = [vec(B T )] T vec(S) 
vec(ECB) = [B T ® E)vec(C) 



B Appendix: Graph Laplacian and Network Connectivity 

Consider a network consisting of N nodes and L edges connecting the nodes to each other. In the construc- 
tions below, we only need to consider the edges that connect distinct nodes to each other; these edges do 
not contain any self-loops that may exist in the graph and which connect nodes to themselves directly. In 
other words, when we refer to the L edges of a graph, we are only excluding self-loops from this set; but we 
are still allowing loops of at least length 2 (i.e., loops generated by paths covering at least 2 edges). 

The neighborhood of any node k is denoted by A4 and it consists of all nodes that node k can share 
information with; these are the nodes that are connected to k through edges, in addition to node k itself. 
The degree of node k, which we denote by nk, is defined as the positive integer that is equal to the size of 
its neighborhood: 

n k = \M k \ (573) 
Since k G A4, we always have n k > 1- We further associate with the network an N x TV Laplacian matrix, 
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denoted by C. The matrix C is symmetric and its entries arc defined as follows [64-66]: 



r n k -i, ak = e 

= < — 1, if k 7^ £ and nodes k and £ are neighbors (574) 
[ 0, otherwise 

Note that the term nt — 1 measures the number of edges that are incident on node k, and the locations 
of the — l's on row k indicate the nodes that are connected to node k. We also associate with the graph 
an N X L incidence matrix, denoted by X. The entries of X are defined as follows. Every column of X 
represents one edge in the graph. Each edge connects two nodes and its column will display two nonzero 
entries at the rows corresponding to these nodes: one entry will be +1 and the other entry will be —1. For 
directed graphs, the choice of which entry is positive or negative can be used to identify the nodes from 
which edges emanate (source nodes) and the nodes at which edges arrive (sink nodes). Since we are dealing 
with undirected graphs, we shall simply assign positive values to lower indexed nodes and negative values to 
higher indexed nodes: 

{+1, if node k is the lower-indexed node connected to edge e 
— 1, if node k is the higher-indexed node connected to edge e (575) 
0, otherwise 

Figure 16 shows the example of a network with N = 6 nodes and L = 8 edges. Its Laplacian and incidence 
matrices are also shown and these have sizes 6x6 and 6x8, respectively. Consider, for example, column 
6 in the incidence matrix. This column corresponds to edge 6, which links nodes 3 and 5. Therefore, at 
location X^q we have a +1 and at location X^q we have —1. The other columns of X are constructed in a 
similar manner. 
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Figure 16: A network with N = 6 nodes and L = 8 edges. The nodes are marked 1 through 6 and the edges are 
marked 1 through 8. The corresponding Laplacian and incidence matrices C and X are 6x6 and 6x8. 



Observe that the Laplacian and incidence matrices of a graph arc related as follows: 

C = XX T (576) 
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The Laplacian matrix conveys useful information about the topology of the graph. The following is a classical 
result from graph theory [64-67]. 

Lemma B.l. (Laplacian and Network Connectivity) Let 

0i>0 2 >...>9 N (577) 
denote the ordered eigenvalues of £. Then the following properties hold: 

(a) £ is symmetric nonnegative- definite so that &i > 0. 

(b) The rows of £ add up to zero so that £t = 0. This means that 1 is a right eigenvector of £ corresponding 
to the eigenvalue zero. 

(c) The smallest eigenvalue is always zero, 6^ — 0. The second smallest eigenvalue, 0n-i, is called the 
algebraic connectivity of the graph. 

(d) The number of times that zero is an eigenvalue of £ (i.e., its multiplicity) is equal to the number of 
connected subgraphs. 

(e) The algebraic connectivity of a connected graph is nonzero, i.e., 9n-i ^ 0. In other words, a graph is 
connected if, and only if, its algebraic connectivity is nonzero. 

Proof. Property (a) follows from the identity L — I X T . Property (b) follows from the definition of £. Note that 
for each row of £, the entries on the row add up to zero. Property (c) follows from properties (a) and (b) since 
£1 — implies that zero is an eigenvalue of £. For part (d), assume the network consists of two separate connected 
subgraphs. Then, the Laplacian matrix would have a block diagonal structure, say, of the form £ — diag{£i, £2}, 
where £\ and £2 are the Laplacian matrices of the smaller subgraphs. The smallest eigenvalue of each of these 
Laplacian matrices would in turn be zero. More generally, if the graph consists of m connected subgraphs, then the 
multiplicity of zero as an eigenvalue of £ would be m. It follows from property (d) that the algebraic connectivity of 
a graph needs to be nonzero in order for the graph to be connected. 

□ 



C Appendix: Stochastic Matrices 

Consider N x N matrices A with nonnegative entries, {aik > 0}. The matrix A = [aik] is said to be 
right-stochastic if it satisfies 

At = t (right-stochastic) (578) 

in which case each row of A adds up to one. The matrix A is said to be left-stochastic if it satisfies 

A T t = 1 (left-stochastic) (579) 

in which case each column of A adds up to one. And the matrix is said to be doubly stochastic if both 
conditions hold so that both its columns and rows add up to one: 

At = 1, A T t = 1 (doubly-stochastic) (580) 

Stochastic matrices arise frequently in the study of adaptation over networks. This appendix lists some of 
their properties. 

Lemma C.l. (Spectral Norm of Stochastic Matrices,) Let A be an N x N right or left or doubly 
stochastic matrix. Then, p(A) = 1 and, therefore, all eigenvalues of A lie inside the unit disc, i.e., |A(j4)| < 1. 
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Proof. We prove the result for right stochastic matrices; a similar argument applies to left or doubly stochastic 
matrices. Let A be a right-stochastic matrix. Then, At = 1, so that A = 1 is one of the eigenvalues of A. Moreover, 
for any matrix A, it holds that p(A) < ||^4||oo, where || ■ ||oo denotes the maximum absolute row sum of its matrix 
argument. But since all rows of A add up to one, we have ||j4||oo = 1. Therefore, p(A) < 1. And since we already 
know that A has an eigenvalue at A = 1, we conclude that p(A) = 1. 

□ 

The above result asserts that the spectral radius of a stochastic matrix is unity and that A has an eigenvalue 
at A = 1. The result, however, does not rule out the possibility of multiple eigenvalues at A = 1, or even 
other eigenvalues with magnitude equal to one. Assume, in addition, that the stochastic matrix A is regular. 
This means that there exists an integer power j a such that all entries of A 3 " are strictly positive, i.e., 

for all (I, k), it holds that [A j °] ek > 0, for some j > (581) 

Then a result in matrix theory known as the Perron-Frobenius Theorem [20] leads to the following stronger 
characterization of the eigen-structure of A. 

Lemma C.2. ('Spectral Norm of Regular Stochastic Matrices,) Let A be an N x N right stochastic 
and regular matrix. Then: 

(a) p{A) = 1. 

(b) All other eigenvalues of A are strictly inside the unit circle (and, hence, have magnitude strictly less 
than one). 

(c) The eigenvalue at A = 1 is simple, i.e., it has multiplicity one. Moreover, with proper sign scaling, all 
entries of the corresponding eigenvector are positive. For a right- stochastic A, this eigenvector is the 
vector t since At = 1. 

(d) All other eigenvectors associated with the other eigenvalues will have at least one negative or complex 
entry. 

Proof. Part (a) follows from Lemma C.l. Parts (b)-(d) follow from the Perron-Frobenius Theorem when A is 
regular [20]. 

□ 



Lemma C.3. (Useful Properties of Doubly Stochastic Matrices ) Let A be an N x N doubly stochastic 
matrix. Then the following properties hold: 

(a) p(A) = 1. 

(b) AA T and A T A are doubly stochastic as well. 

(c) p(AA T ) = p{A T A) = 1. 

(d) The eigenvalues of AA T or A T A are real and lie inside the interval [0, 1]. 

(e) I - AA T > and I- A T A > 0. 

(f) Tr(A T HA) < Tr(H), for any N x N nonnegative- definite Hermitian matrix H. 

Proof. Part (a) follows from Lemma C.l. For part (b), note that AA T is symmetric and AA T 1 = At = 1. Therefore, 
AA T is doubly stochastic. Likewise, for A T A. Part (c) follows from part (a) since AA T and A T A are themselves 
doubly stochastic matrices. For part (d), note that AA T is symmetric and nonnegative-definite. Therefore, its 
eigenvalues are real and nonnegative. But since p(AA T ) — 1, we must have X(AA T ) £ [0, 1]. Likewise, for the matrix 
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A T A. Part (e) follows from part (d). For part (f), since AA T > and its eigenvalues lie within [0, 1], the matrix 
AA T admits an eigen-decomposition of the form: 



AA T = UAU T 

where U is orthogonal (i.e., = U T ) and A is diagonal with entries in the range [0, 1]. It then follows that 

Ti(A T HA) = Ti(AA T H) 

= Ti{UAU T H) 

= Ti(AU T HU) 

(*) „ 

< Tr(U T HU) 

= Tr(UU T H) 

= Tr(H) 

where step (*) is because U T HU = U~ 1 HU and, by similarity, the matrix U~ X HU has the same eigenvalues as H. 
Therefore, U T HU > 0. This means that the diagonal entries of U T HU are nonnegative. Multiplying U T HU by A 
ends up scaling the nonnegative diagonal entries to smaller values so that (*) is justified. 

□ 

D Appendix: Block Maximum Norm 

Let x = coljxi, x%, . . . , xn} denote an N x 1 block column vector whose individual entries are of size M x 1 
each. Following [52,54,55], the block maximum norm of x is denoted by ||a;||b i00 and is defined as 



\x\\b,oo = max \\x k \ 

Kk<N 



(582) 




where || • || denotes the Euclidean norm of its vector argument. Correspondingly, the induced block maximum 
norm of an arbitrary N x N block matrix A, whose individual block entries are of size M x M each, is defined 
as 

(583) 

The block maximum norm inherits the unitary invariance property of the Euclidean norm, as the following 
result indicates [54]. 

Lemma D.l. (Unitary Invariance) Let U — diag{J7i, U2, ■ ■ ■ , Un} be an N x N block diagonal matrix 
with M x M unitary blocks {Uk}- Then, the following properties hold: 

(a) \\Ux\\ bt oo = \\x\\b,oo 

(b) \\UAU*\\ bt00 - Mlkoo 

for all block vectors x and block matrices A of appropriate dimensions. □ 



The next result provides useful bounds for the block maximum norm of a block matrix. 



Lemma D.2. (Useful Bounds) Let A be an arbitrary N x N block matrix with blocks Aik of size M x M 
each. Then, the following results hold: 
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(a) The norms of A and its complex conjugate are related as follows: 

\\A*\\ b ,oo < JV • H^llt.00 (584) 

(b) The norm of A is bounded as follows: 



max \\A ek \\ < \\A\\ bl00 < N-[ max \\A ek \\ ) (585) 

\<l,k<N \l<l,k<N J 

where \\ ■ || denotes the 2— induced norm (or maximum singular value) of its matrix argument. 

(c) If A is Hermitian and nonnegative- definite (A > 0), then there exist finite positive constants c\ and ci 
such that 

ci • Tr(A) < \\A\\ b ,oo < c 2 ■ Tr{A) (586) 
Proof. Part (a) follows directly from part (b) by noting that 

\A*\\b,oo < N ■ ( max 

\Kl,k<N 



= N ■ max \\Aek I 

\l<e,k<N 
< AT -11^116,00 

where the equality in the second step is because ||j4.J fc || = || -4«fc||; i-e., complex conjugation does not alter the 2— induced 
norm of a matrix. 

To establish part (b), we consider arbitrary iVxl block vectors x with entries x = co\{x\,X2, ■ ■ ■ , xn} and where 
each Xk is M x 1. Then, note that 



H-Azlkoo = max 

KKN 



N 

y^^AtkXk 



k=l 
N 



< max \\A lk \\ ■ \\xk\ 

KKN \ ' 
\fe=l 

< max y^H^fell • max ||a; fe || 

\l<*<N<<->" J l<k<N 

/ N \ 

< max max \\Atk\\ ■ MIm 

~ \ KKN £— ^ l<fe<JV J 



< N ■ max \\A ek \\ • Hilli,,, 

\Kf,fc<iV 



so that 



\\A\\ btOB = max W^h22 < N .( max ||^ fe | 
\\x\\ btao \i<e.k<N 

which establishes the upper bound in (585). 

To establish the lower bound, we assume without loss of generality that maxi<e,k<N \\Ae k \\ is attained at I — 1 
and k = 1. Let o\ denote the largest singular value of An and let {vi,ui} denote the corresponding Mxl right and 
left singular vectors. That is, 

||An|| = <ti, An«i=ai«i (587) 
where vi and ui have unit norms. We now construct an N x 1 block vector x° as follows: 

x° = col{«i, Ou, Om, Om} (588) 

Then, obviously, 

||x ||6,oo = 1 (589) 
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and 

It follows that 



Ax" = col{An«i, A21V1, AniV!} (590) 

||-4a; ||b,co = max col{ |[AnVi||, ||A 2 iVi||, ||AjviWi[|} 
> ||An«i|| 
— 

= &i 
= \\Au\\ 

max \\AtkW (591) 

Kl,k<N " " V ' 



Therefore, by the definition of the block maximum norm, 



\\A\\ h ^ = max^M^ 



\ \\x\\ b , a 

,00 

ll^lkoo 

11*4.37 ||t*,oo 

> max 1 1 A tt 1 1 (592) 

l<£,k<N 

which establishes the lower bound in (585). 

To establish part (c), we start by recalling that all norms on finite-dimensional vector spaces are equivalent [20,21]. 
This means that if || • \\ a and |j • ||d denote two different matrix norms, then there exist positive constants ci and C2 
such that for any matrix X, 

ci • \\X\\ a < \\X\\ d < c 2 ■ ||X|| a (593) 
Now, let denote the nuclear norm of the square matrix A. It is defined as the sum of its singular values: 

11-411* = $>m(^) (594) 

m 

Since A is Hermitian and nonnegative-definite, its eigenvalues coincide with its singular values and, therefore, 

\\A\U = *£\ m (A) = Tr(A) 

m 

Now applying result (593) to the two norms ||.4||i,,oo and \\A\\ t we conclude that 

d-Ti(A) < \\A\\ b ,oo < c 2 -Tr(A) (595) 
as desired. □ 

The next result relates the block maximum norm of an extended matrix to the 00— norm (i.e., maximum 
absolute row sum) of the originating matrix. Specifically, let A be an N x N matrix with bounded entries 
and introduce the block matrix 

A = A®I M (596) 
The extended matrix A has blocks of size M x M each. 

Lemma D.3. (Relation to Maximum Absolute Row Sum) Let A and A be related as in (596). Then, 
the following properties hold: 

(a) ||^4.|| i.,00 = Halloo? where the notation || ■ denotes the maximum absolute row sum of its argument. 

(b) M*|kao < N-\\A\\ b<00 . 
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Proof. The results are obvious for a zero matrix A. So assume A is nonzero. Let x = col{a;i, X2, ■ • ■ , xn} denote an 
arbitrary iV x 1 block vector whose individual entries {x^} axe vectors of size M X 1 each. Then, 



||Ae||&,c 



max 

Kk<N 



N 



so that 



Pike 



< max \a k i\ ■ \\x e \\ 

l<k<N \~—^ J 



max \aki\ ■ max \\xt\ 

Kk<N *-~> I KKN 

- - 1=1 

|A|U • ||x|| 6l00 



= maxM^ < ||A|| C 
a; 6,oo 



(597) 



(598) 



The argument so far establishes that ||„4|koo < ||^4||oo- Now, let k denote the row index that corresponds to the 
maximum absolute row sum of A, i.e., 

N 

1=1 

We construct an N x 1 block vector z — col{2i, 22, ... , zn}, whose Mxl entries {zi} are chosen as follows: 

ze = sign(a fco £) • ei 

where ei is the Mxl basis vector: 

ei = col{l,0,0, ... ,0} 

and the sign function is defined as 

sign(a) = 

Then, note that 2 7^ for any nonzero matrix A, and 



1, ifa>0 
-1, otherwise 



IMk 



max \\zA\ = 1 
KKN 11 11 



Moreover, 



||-4|koo = max 



\\Ax\\ 



> 



^0 \\x\\ h ,, 
\\Az\\ bt00 

\\z\\b,oo 
\\Az\\b,oo 



max 

Kfe<iV 



N 

^2 akeze 



N 

^ a.k a izi 
1=1 

N 

^2, ak oi ■ sign(a feof )ei 
1=1 

£K<Hki|| 

1=1 

N 

\ ak ° i \ 



\\A\\ 



(599) 
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Combining this result with (598) we conclude that ||^4||i,,oo = Halloo, which establishes part (a). Part (b) follows from 
the statement of part (a) in Lemma D.2. 

□ 

The next result establishes a useful property for the block maximum norm of right or left stochastic matrices; 
such matrices arise as combination matrices for distributed processing over networks as in (166) and (185). 



Lemma D.4. (Right and Left Stochastic Matrices) Let C be an N x N right stochastic matrix, i.e., 
its entries are nonnegative and it satisfies CT = 1. Let A be an N x N left stochastic matrix, i.e., its entries 
are nonnegative and it satisfies A T \ — 1. Introduce the block matrices 



A T = A T ®I M , C = C®I M 
The matrices A and C have blocks of size M x M each. It holds that 



Proof. Since A T and C are right stochastic matrices, it holds that WA 1 
then follows from part (a) of Lemma D.3. 



1 and HCII 



(600) 



(601) 



1. The desired result 



□ 



The next two results establish useful properties for the block maximum norm of a block diagonal matrix 
transformed by stochastic matrices; such transformations arise as coefficient matrices that control the evo- 
lution of weight error vectors over networks, as in (189). 



Lemma D.5. (Block Diagonal Hermitian Matrices) Consider an N x TV block diagonal Hermitian 
matrix T> = diag{£>i, D2, . . . , Djy}, where each D k is M x M Hermitian. It holds that 



piV) = max p{D k ) = ||2?|| 6 .oo 

Kk<N 



(602) 



where p(-) denotes the spectral radius (largest eigenvalue magnitude) of its argument. That is, the spectral 
radius of T> agrees with the block maximum norm of T>, which in turn agrees with the largest spectral radius 
of its block components. 

Proof. We already know that the spectral radius of any matrix X satisfies p(X) < \\X\\, for any induced matrix 
norm [19,20]. Applying this result to T> we readily get that p(T>) < ||2?||& )0 o. We now establish the reverse inequality, 
namely, ||2?||j, j00 < p(T>). Thus, pick an arbitrary N x 1 block vector x with entries {xi,X2, ■ ■ ■ ,%n}, where each Xk 
is M x 1. From definition (583) we have 



W\ 



b.oc — 



max ■ 



max 



max 



X b . 



1 



\\x\\b,c 
1 

\\x\\b,c 



max 

l<*<iV 



max (H-DjJ 

Kk<N 



Ml) 



max max 

x^tO l<k<N 

max \\D k \\ 

l<k<N 

max p(Dk) 

Kk<N 



\D k 



\x\\b,oc 



(603) 



where the notation \\Dk \\ denotes the 2— induced norm of D k (i.e., its largest singular value). But since D k is assumed 
to be Hermitian, its 2— induced norm agrees with its spectral radius, which explains the last equality. 

□ 
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Lemma D.6. (Block Diagonal Matrix Transformed by Left Stochastic Matrices) Consider an 
TV x TV block diagonal Hermitian matrix T> = diag{Z?i, D2, ■ ■ . , D^}, where each is M x M Hermitian. 
Let A\ and A2 be N x TV left stochastic matrices, i.e., their entries are nonnegative and they satisfy Ajt = 1 



and A\\ = 1. Introduce the block matrices 



Aj = A\ ® I M , Al = Al <g) I M 
The matrices A\ and A2 have blocks of size M x M each. Then it holds that 



p{Al-V-A\) < P (V) 



(604) 



(605) 



Proof. Since the spectral radius of any matrix never exceeds any induced norm of the same matrix, we have that 



< 
< 

(601) 
(602) 



Ai-V-Al 



Ai 



W\\ 



Pip) 



WW 



AT 



In view of the result of Lemma D.5, we also conclude from (605) that 



p(Al-V-A\) < max . p(D k ) 

±<K<r\ 



(606) 
□ 



(607) 



It is worth noting that there are choices for the matrices {Ai, A2,T>} that would result in strict inequality 
in (605). Indeed, consider the special case: 

1 2 



V 



2 
1 



A{ 



Al 



This case corresponds to TV = 2 and M = 1 (scalar blocks). Then, 

A\VA\ = 

and it is easy to verify that 



1.52 



p(V) = 2, p{AlVAl) 

The following conclusions follow as corollaries to the statement of Lemma D.6, where by a stable matrix X 
we mean one whose eigenvalues lie strictly inside the unit circle. 

Corollary D.l. (Stability Properties) Under the same setting of Lemma D.6, the following conclusions 
hold: 

(a) The matrix A^T>A^ is stable whenever T> is stable. 

(b) The matrix A^DA^ is stable for all possible choices of left stochastic matrices A\ and A2 if, and only 
if, T> is stable. 

Proof. Since D is block diagonal, part (a) follows immediately from (605) by noting that p(T>) < 1 whenever T> is 
stable. [This statement hxes the argument that appeared in App. I of [18] and Lemma 2 of [33]. Since the matrix X 
in App. I of [18] and the matrix M in Lemma 2 of [33] are block diagonal, the || • ||i,,oo norm should simply replace 
the || ■ \\ p norm used there, as in the proof that led to (606) and as already done in [54].] For part (b), assume first 
that T> is stable, then A%T>A[ will also be stable by part (a) for any left-stochastic matrices Ai and Ai- To prove 
the converse, assume that A^T>A{ is stable for any choice of left stochastic matrices and A2- Then, A2DAT is 
stable for the particular choice Ai — I — Ai and it follows that T> must be stable. 



□ 
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E Appendix: Comparison with Consensus Strategies 



Consider a connected network consisting of N nodes. Each node has a state or measurement value x^, 
possibly a vector of size Mxl. All nodes in the network are interested in evaluating the average value of 
their states, which we denote by 



1 N 

-Y 



(608) 



k=l 



A centralized solution to this problem would require each node to transmit its measurement Xk to a fusion 
center. The central processor would then compute w° using (608) and transmit it back to all nodes. This 
centralized mode of operation suffers from at least two limitations. First, it requires communications and 
power resources to transmit the data back and forth between the nodes and the central processor; this 
problem is compounded if the fusion center is stationed at a remote location. Second, the architecture has a 
critical point of failure represented by the central processor; if it fails, then operations would need to be halted. 

Consensus Recursion 

The consensus strategy provides an elegant distributed solution to the same problem, whereby nodes interact 
locally only with their neighbors and arc able to converge to w° through these interactions. Thus, consider 
an arbitrary node k and assign nonnegative weights {aik} to the edges linking k to its neighbors I £ Afk- 
For each node k 7 the weights {a^} are assumed to add up to one so that 



forfc = 1,2,...,JV: 
atk > 0, ^2 atk = 1, aik = if i Afk 



N 



(609) 



The resulting combination matrix is denoted by A and its k— th column consists of the entries {aik,t = 
1,2,..., TV}. In view of (609), the combination matrix A is seen to satisfy A T t = 1. That is, A is left- 
stochastic. The consensus strategy can be described as follows. Each node k operates repeatedly on the data 
from its neighbors and updates its state iteratively according to the rule: 



Wk,n = ^2 aik w t-,n-l, Tl > 



(610) 



where w<>,„_i denotes the state of node I at iteration n — 1, and Wk, n denotes the updated state of node k 
after iteration n. The initial conditions are 



w k ,o = %k, k = 1,2, ...,N 
If we collect the states of all nodes at iteration n into a column vector, say, 

Z n = C0l{tUi, n , V)2, n , •■•,WN,n} 

Then, the consensus iteration (610) can be equivalently rewritten in vector form as follows: 



Zn = A T z„-x, n > 



where 

The initial condition is 



A T = A T ® I M 



A 



col{xi, x 2l ■ ■ -,x N } 



(611) 
(612) 

(613) 
(614) 

(615) 
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Error Recursion 



Note that we can express the average value, w°, from (608) in the form 



w° = — ■ (t 1 ®I M ) ■ z 



where 1 is the vector of size M X 1 and whose entries are all equal to one. Let 

Wk.n =W° — Wk,n 



(616) 



(617) 



denote the weight error vector for node k at iteration n; it measures how far the iterated state is from the 
desired average value w°. We collect all error vectors across the network into an N x 1 block column vector 
whose entries are of size Mxl each: 



A 



W 2 ,n 



WN,n 



Then. 



w n = (1 <8> Im)w° - z n 



(618) 



(619) 



Convergence Conditions 

The following result is a classical result on consensus strategies [42-44]. It provides conditions under which 
the state of all nodes will converge to the desired average, w°, so that w„ will tend to zero. 

Theorem E.l. ("Convergence to Consensus J For any initial states {xk}, the successive iterates Wk. n 
generated by the consensus iteration (610) converge to the network average value w° as n — > oo if, and only 
if, the following three conditions are met: 

A T l = 1 (620) 
At = I (621) 

p(V-lll T ) < 1 (622) 

That is, the combination matrix A needs to be doubly stochastic, and the matrix A T — j^tt T needs to be 
stable. 

Proof. (Sufficiency) . Assume first that the three conditions stated in the theorem hold. Since A is doubly stochastic, 
then so is any power of A, say, A n for any n > 0, so that 

[A n ] T l = l, A n t = l (623) 

Using this fact, it is straightforward to verify by induction the validity of the following equality: 

1 \ n 1 
A T --^U T ) = [A n ] T -^U T (624) 

Likewise, using the Kronecker product identities 

(E + B)®C = (£®C) + (B®C) (625) 
(E®B)(C®D) = (EC <g> BD) (626) 
(E (g> B) n = E n ®B n (627) 
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for matrices {E, B, C, D} of compatible dimensions, we observe that 



(A n f ~ ^ • (1 ® /m) • (1 T ® /a/) = 



(H24) 



[(A" 



_1 



) Im 



Iterating (613) we find that 
and, hence, from (616) and (619), 



— [A ] Zo 



(628) 
(629) 



(A n ) T 



4 • (1 ® /a/) ■ (1 T ® Im) 



(028) 



(630) 



Now recall that, for two arbitrary matrices C and D of compatible dimensions, the eigenvalues of the Kronecker 
product C £g> D is formed of all product combinations \i(C)\j(D) of the eigenvalues of C and D [19]. We conclude 
from this property, and from the fact that A T — ill T is stable, that the coefficient matrix 



A 1 



is also stable. Therefore, 

w n — > asn-)oo (631) 
(Necessity). In order for 2„ in (629) to converge to (1 ® Im)w°, for any initial state z , it must hold that 

, T 



lim (A n f -Zo = ^ • (1 ® /m) ■ (1"' O /a/) • z„ 



for any z . This implies that we must have 
or, equivalently, 

This in turn implies that we must have 



lim (A" 



jr ■ (11 t ®/m) 



lim (A n f = ^11 T 



lim A T ■ (A n ) T = A T 



N~ 



-11 J 



But since 



lim A T ■ (A n ) T 

n — S-oo 



lim (A n+1 ) 



lim (A ri 



we conclude from (634) and (635) that it must hold that 



4-ll T = ^-A T -tl T 

N N 



That is, 



1_ 

N 



A 1 ! - 1] • 1 J = 



(632) 

(633) 
(634) 

(635) 
(636) 

(637) 
(638) 



from which we conclude that we must have A T \ — 1. Similarly, we can show that Al — 1 by studying the limit 
of (A n ) T A T . Therefore, A must be a doubly stochastic matrix. Now using the fact that A is doubly stochastic, we 
know that (624) holds. It follows that in order for condition (634) to be satisfied, we must have 

P (a t - ^U T ) < 1 (639) 

□ 
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Rate of Convergence 

From (630) we conclude that the rate of convergence of the error vectors {wk,n} to zero is determined by 
the spectrum of the matrix 

A T - j^tt T (640) 

Now since A is a doubly stochastic matrix, we know that it has an eigenvalue at A = 1. Let us denote the 
eigenvalues of A by Xk{A) and let us order them in terms of their magnitudes as follows: 

0< |A M (A)| < ... < |A 3 (A)| < \X 2 (A)\ < 1 (641) 

where Xi(A) = 1. Then, the eigenvalues of the coefficient matrix (A T — -^11 T ) are equal to 

{X M (A),,..., X 3 (A), X 2 (A),0} (642) 

It follows that the magnitude of X2{A) becomes the spectral radius of A T — ^11 T . Then condition (639) 
ensures that |Aa(yl)| < 1. We therefore arrive at the following conclusion. 

Corollary E.l. (Rate of Convergence of Consensus,) Under conditions (620)-(622), the rate of con- 
vergence of the successive iterates {wk, n } towards the network average w° in the consensus strategy (610) is 
determined by the second largest eigenvalue magnitude of A, i.e., by | A2 (^4) | as defined in (641). 

□ 

It is worth noting that doubly stochastic matrices A that are also regular satisfy conditions (620)-(622). 
This is because, as we already know from Lemma C.2, the eigenvalues of such matrices satisfy |A rn (A)| < 1, 
for m = 2, 3, . . . , N, so that condition (622) is automatically satisfied. 

Corollary E.2. (Convergence for Regular Combination Matrices,) Any doubly- stochastic and regular 
matrix A satisfies the three conditions (620)-(622) and, therefore, ensures the convergence of the consensus 
iterates {wk, n } generated by (610) towards w° as n — > 00. 

□ 

A regular combination matrix A would result when the two conditions listed below are satisfied by the graph 
connecting the nodes over which the consensus iteration is applied. 

Corollary E.3. ('Sufficient Condition for Regularity,) Assume the combination matrix A is doubly 
stochastic and that the graph over which the consensus iteration (610) is applied satisfies the following two 
conditions: 

(a) The graph is connected. This means that there exists a path connecting any two arbitrary nodes in the 
network. In terms of the Laplacian matrix that is associated with the graph (see Lemma B.l), this 
means that the second smallest eigenvalue of the Laplacian is nonzero. 

(b ) aek = if, and only if, £ ^ A4 . That is, the combination weights are strictly positive between any two 
neighbors, including auk > 0. 

Then, the corresponding matrix A will be regular and, therefore, the consensus iterates {wk, n } generated by 
(610) will converge towards w° as n — > 00. 

Proof. We first establish that conditions (a) and (b) imply that A is a regular matrix, namely, that there should 
exist an integer j > such that 

\a 3 °] > (643) 

L J Ik 

for all (£, k). To begin with, by the rules of matrix multiplication, the (£, k) entry of the i— th power of A is given by: 

N N N 

A'j = ^2 ^2 ■■■ a tmi a mim2 . . .a mi _ lk (644) 

mj-1 TTI2 — 1 ?n,j_ 1 =1 
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The summand in (644) is nonzero if, and only if, there is some sequence of indices (£, mi, . . . , m,_i, k) that forms 
a path from node £ to node k. Since the network is assumed to be connected, there exists a minimum (and finite) 
integer value iik such that a path exists from node I to node k using iik edges and that 



\A Hk \ > 
L iik 



In addition, by induction, if [A Hk ] ek > 0, then 



[^ +1 L - E 

> [ AHk ] lk a kk 



O-mk 



Let 



> 



j D = max {i ek } 

Kk,KN 



Then, property (643) holds for all (£, k). And we conclude from (581) that A is a regular matrix. It then follows from 
Corollary E.2 that the consensus iterates {wk,n} converge to the average network value w° . 

□ 



Comparison with Diffusion Strategies 

Observe that in comparison to diffusion strategies, such as the ATC strategy (153), the consensus iteration 
(610) employs the same quantities w k> . on both sides of the iteration. In other words, the consensus con- 
struction keeps iterating on the same set of vectors until they converge to the average value w° . Moreover, 
the index n in the consensus algorithm is an iteration index. In contrast, diffusion strategics employ different 
quantities on both sides of the combination step in (153), namely, w k> i and {il>e,i}] the latter variables have 
been processed through an information exchange step and arc updated (or filtered) versions of the we,i~i- In 
addition, each step of the diffusion strategy (153) can incorporate new data, {dg(i), it^i}, that are collected 
by the nodes at every time instant. In this way, diffusion strategies are able to incorporate new information 
during each update. Moreover, the index i in the diffusion implementation is a time index (and not an iter- 
ation index); this is because diffusion strategies are inherently adaptive and perform online learning. Data 
keeps streaming in and diffusion incorporates the new data into the update equations at every time instant. 
As a result, diffusion strategies are able to respond to data in an adaptive manner, and they are also able 
to solve general optimization problems: the vector w° in adaptive diffusion iterations is the minimizcr of a 
global cost function (cf. (92)), while the vector w° in consensus iterations is the average value of the initial 
states of the nodes (cf. (608)). 

Moreover, it turns out that diffusion strategics influence the evolution of the network dynamics in an 
interesting and advantageous manner in comparison to consensus strategics. Wc illustrate this point by 
means of an example. Consider initially the ATC strategy (158) without information exchange, whose 
update equation we repeat below for ease of reference: 

■0fc,i = Wk,i-i + UkU* k>i [d k (i) - u k)i w k ,i-i] (645) 
w k)i = a *k i>i,i (ATC diffusion ) (646) 

These recursions were derived in the body of the article as an effective distributed solution for optimizing 
(92)-(93). Note that they involve two steps, where the weight estimator w k ,i-i is first updated to the 
intermediate estimator ip k i , before the intermediate estimators from across the neighborhood are combined 
to obtain w k> i. Both steps of ATC diffusion (645)-(646) can be combined into a single update as follows: 



W k .i = E aek w £,i-i ' 


\- 2J V-ia-ikULi [di{i) - Ui.iWz^x] 


(ATC diffusion) 


(647) 
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Likewise, consider the CTA strategy (159) without information exchange, whose update equation we also 
repeat below: 



4, 



k,i-l 



E 



(CTA diffusion ) 



Wk,i = V'fc.i-i + Mfc*4,i [d k (i) - «fc,iV'fc,i-i] 



(648) 
(649) 



Again, the CTA strategy involves two steps: the weight estimators from the neighborhood of node 

k arc first combined to yield the intermediate estimator xp k which is subsequently updated to w^.i- Both 
steps of CTA diffusion can also be combined into a single update as follows: 




(CTA diffusion) 



(650) 



Now, motivated by the consensus iteration (610), and based on a procedure for distributed optimization 
suggested in [52] (see expression (7.1) in that reference), several works in the literature (e.g., [45,53,82-86]) 
have regularly considered distributed strategies that correspond to the following form for the optimization 
problem under consideration (see, e.g., expression (1.20) in [53] and expression (9) in [85]): 



w k . 



} J aek We,i-i 



(consensus strategy) 



(651) 



This strategy can be derived by following the same argument we employed earlier in Sees. 3.2 and 4 to 
arrive at the diffusion strategies, namely, we replace w° in (127) by wi^-\ and then apply the instantaneous 
approximations (150). Note that the same variable w^,- appears on both sides of the equality in (651). 
Thus, compared with the ATC diffusion strategy (647), the update from w^.i-i to Wk,% in the consensus 
implementation (651) is only influenced by data {dk(i), Uk,i} from node k. In contrast, the ATC diffusion 
structure (645)-(646) helps incorporate the influence of the data {dt{i),ui,i\ from across the neighborhood 
of node k into the update of tUfc^, since these data are reflected in the intermediate estimators i }. 
Likewise, the contrast with the CTA diffusion strategy (650) is clear, where the right-most term in (650) 
relies on a combination of all estimators from across the neighborhood of node k, and not only on Wk,i-i as 
in the consensus strategy (651). These facts have desirable implications on the evolution of the weight-error 
vectors across diffusion networks. Some simple algebra, similar to what we did in Sec. 6, will show that the 
mean of the extended error vector for the consensus strategy (651) evolves according to the recursion: 



Ewi = (A T -MTZ U ) -Efiji-i, i>0 



(consensus strategy) 



(652) 



where 1Z U is the block diagonal covariance matrix defined by (184) and Wi is the aggregate error vector 
defined by (230). We now compare the above mean error dynamics with the ones that correspond to the 
ATC and CTA diffusion strategies (645)-(646) and (648)-(650); their error dynamics follow as special cases 
from (248) by setting Ai = I = C and A 2 = A for ATC and A 2 = I = C and A X =A for CTA: 





Ew l 


= A T {I NM - MK U ) 


■ Eu>i_i, 


i > 


and 












Euii 


= (Inm - MR*) A T 


• Ews_i, 


i > 



(ATC diffusion) 



(CTA diffusion) 



(653) 



(654) 
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We observe that the coefficient matrices that control the evolution of Eio.; are different in all three cases. In 
particular, 



It follows that the mean stability of the consensus network is sensitive to the choice of the combination 
matrix A. This is not the case for the diffusion strategies. This is because from property (605) established in 
App. D, we know that the matrices A T {Inm ~ MTZ U ) and (Inm — M1Z U ) A T arc stable if (Inm — M1Z U ) 
is stable. Therefore, we can select the step-sizes to satisfy Hk < 2/A max (-Ru,fc) for the ATC or CTA diffusion 
strategies and ensure their mean stability regardless of the combination matrix A. This also means that 
the diffusion networks will be mean stable whenever the individual nodes are mean stable, regardless of the 
topology defined by A. In contrast, for consensus networks, the network can exhibit unstable mean behavior 
even if all its individual nodes are stable in the mean. For further details on the mean-square performance 
of diffusion networks in relation to consensus networks, the reader may refer to [87]. 
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